A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language
Models
- URL: http://arxiv.org/abs/2201.05601v2
- Date: Tue, 18 Jan 2022 09:38:47 GMT
- Title: A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language
Models
- Authors: V\'esteinn Sn{\ae}bjarnarson, Haukur Barri S\'imonarson, P\'etur Orri
Ragnarsson, Svanhv\'it Lilja Ing\'olfsd\'ottir, Haukur P\'all J\'onsson,
Vilhj\'almur {\TH}orsteinsson, Hafsteinn Einarsson
- Abstract summary: We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks.
We introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD)
We show that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We train several language models for Icelandic, including IceBERT, that
achieve state-of-the-art performance in a variety of downstream tasks,
including part-of-speech tagging, named entity recognition, grammatical error
detection and constituency parsing. To train the models we introduce a new
corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection
of high quality texts found online by targeting the Icelandic top-level-domain
(TLD). Several other public data sources are also collected for a total of 16GB
of Icelandic text. To enhance the evaluation of model performance and to raise
the bar in baselines for Icelandic, we translate and adapt the WinoGrande
dataset for co-reference resolution. Through these efforts we demonstrate that
a properly cleaned crawled corpus is sufficient to achieve state-of-the-art
results in NLP applications for low to medium resource languages, by comparison
with models trained on a curated corpus. We further show that initializing
models using existing multilingual models can lead to state-of-the-art results
for some downstream tasks.
Related papers
- SWEb: A Large Web Dataset for the Scandinavian Languages [11.41086713693524]
This paper presents the largest pretraining dataset for the Scandinavian languages: the Scandinavian WEb (SWEb)
We introduce a novel model-based text extractor that significantly reduces complexity in comparison with rule-based approaches.
We also introduce a new cloze-style benchmark for evaluating language models in Swedish, and use this test to compare models trained on the SWEb data to models trained on FineWeb, with competitive results.
arXiv Detail & Related papers (2024-10-06T11:55:15Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - Operationalizing a National Digital Library: The Case for a Norwegian
Transformer Model [0.0]
We show the process of building a large-scale training set from digital and digitized collections at a national library.
The resulting Bidirectional Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks.
arXiv Detail & Related papers (2021-04-19T20:36:24Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.