The Icelandic language is spoken by around 350.000 people. As for other smaller languages of the world, strong language resources are important to its survival in a digital age. IceBERT is a neural language model for Icelandic (trained using the RoBERTa architecture) that shows state of the art performance for a variety of downstream tasks in Icelandic, including part-of-speech tagging, named entity recognition, constituency parsing, question answering and summarization.
To train the model a new corpus of Icelandic text is created, the Icelandic Common Crawl Corpus (IC3) which is a collection of texts found online and acquired by efficient methods. Several other public data sources are collected for a total of 16GB of Icelandic text. To further evaluate the performance of the model and raise the bar in baselines for Icelandic, we translate and adapt the WinoGrande dataset for co-reference resolution. Through these efforts we demonstrate that state of the art NLP applications are within reach for smaller, lower-resource languages. (Icelandic is the 54th least represented and most represented language in the Common Crawl data.)
- The Icelandic Common Crawl Corpus (IC3), collected, cleaned up and deduplicated. Available upon request.
- Icelandic language model - IceBERT, trained using Fairseq and exported to be compatible with transformers/Huggingface