Does Corpus Quality Really Matter for Low-Resource Languages?
- URL: http://arxiv.org/abs/2203.08111v1
- Date: Tue, 15 Mar 2022 17:40:27 GMT
- Title: Does Corpus Quality Really Matter for Low-Resource Languages?
- Authors: Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz
Perez-de-Vi\~naspre, Aitor Soroa
- Abstract summary: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl.
Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl.
Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4.
- Score: 27.315905109092466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vast majority of non-English corpora are derived from automatically
filtered versions of CommonCrawl. While prior work has identified major issues
on the quality of these datasets (Kreutzer et al., 2021), it is not clear how
this impacts downstream performance. Taking Basque as a case study, we explore
tailored crawling (manually identifying and scraping websites with high-quality
content) as an alternative to filtering CommonCrawl. Our new corpus, called
EusCrawl, is similar in size to the Basque portion of popular multilingual
corpora like CC100 and mC4, yet it has a much higher quality according to
native annotators. For instance, 66% of documents are rated as high-quality for
EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain
similar results on downstream tasks regardless of the corpus used for
pre-training. Our work suggests that NLU performance in low-resource languages
is primarily constrained by the quantity rather than the quality of the data,
prompting for methods to exploit more diverse data sources.
Related papers
- Separating the Wheat from the Chaff with BREAD: An open-source benchmark
and metrics to detect redundancy in text [9.484323358958706]
We create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content.
We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD.
arXiv Detail & Related papers (2023-11-11T00:11:50Z) - ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model [40.23569361268597]
We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
arXiv Detail & Related papers (2023-11-02T11:13:51Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - esCorpius: A Massive Spanish Crawling Corpus [2.262838186547612]
esCorpius is a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data.
It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content.
arXiv Detail & Related papers (2022-06-30T09:29:18Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [21.375943264243144]
We manually audit the quality of 205 language-specific corpora released with five major public datasets.
We find that at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality.
We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses.
arXiv Detail & Related papers (2021-03-22T17:30:33Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Practical Comparable Data Collection for Low-Resource Languages via
Images [126.64069379167975]
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators.
Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all.
arXiv Detail & Related papers (2020-04-24T19:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.