Open QA for Icelandic and English
Question Answering (QA) is the automated task of providing an answer to a question posed in human language. Whether through search engines or speech controlled home assistants it has become a tightly integrated part of many peoples' daily routine at work or home. In recent years, these methods have improved greatly for commonly spoken languages such as English. This can almost wholly be attributed to advances in sequence modeling using deep neural networks, an increase in computing power, and the creation of large data sets suitable for training.
In this thesis, such QA methods are described, implemented and evaluated for Icelandic. The methods applied are a statistical approach based on term frequency, a current standard practices approach using a neural language model for Icelandic and a modern variant using pre-encoded phrase lookup. A new QA corpus and Icelandic language models are also presented.
The result is a baseline for extractive QA in Icelandic, where an answer is highlighted in a single document or larger corpora. Finally, a cross-lingual extension of the phrase lookup method is investigated and adapted for Icelandic QA. In this system, questions can be asked in Icelandic and are answered with segments from the English Wikipedia. This system is then adapted to answer Icelandic questions in Icelandic using segments from the Icelandic Wikipedia, taking advantage of a bilingual language model.
- Dataset for question answering in Icelandic - Natural Questions in Icelandic (NQiI) - https://repository.clarin.is/repository/xmlui/handle/20.500.12537/143
- The Icelandic Common Crawl Corpus (IC3), collected, cleaned up and deduplicated. Available upon request.
- Icelandic language model - IceBERT, trained using Fairseq and exported to be compatible with transformers/Huggingface
- IceBERT-QA - a model fine tuned for extractive QA in Icelandic on NQiI and machine translated datasets NewsQA and SQuAD
- XLMR-ENIS - crosslingual Icelandic and English language model trained with Fairseq and ported to transformers/huggingface. The English training data used is Books3 (160GB).
- An Icelandic DensePhrases model that can be used for Open QA.