Related papers: Does Corpus Quality Really Matter for Low-Resource Languages?

Does Corpus Quality Really Matter for Low-Resource Languages?

URL: http://arxiv.org/abs/2203.08111v1
Date: Tue, 15 Mar 2022 17:40:27 GMT
Title: Does Corpus Quality Really Matter for Low-Resource Languages?
Authors: Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Vi\~naspre, Aitor Soroa
Abstract summary: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4.
Score: 27.315905109092466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is primarily constrained by the quantity rather than the quality of the data, prompting for methods to exploit more diverse data sources.

Related papers

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages. We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z)
How Good is Your Wikipedia? [13.814955569390207]
This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques. We find that data quality pruning is an effective means for resource-efficient training without hurting performance.
arXiv Detail & Related papers (2024-11-08T12:35:58Z)
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text [9.484323358958706]
We create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content. We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD.
arXiv Detail & Related papers (2023-11-11T00:11:50Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
esCorpius: A Massive Spanish Crawling Corpus [2.262838186547612]
esCorpius is a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content.
arXiv Detail & Related papers (2022-06-30T09:29:18Z)
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z)
Intent Classification Using Pre-Trained Embeddings For Low Resource Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing. We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z)
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [21.375943264243144]
We manually audit the quality of 205 language-specific corpora released with five major public datasets. We find that at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses.
arXiv Detail & Related papers (2021-03-22T17:30:33Z)
XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
Practical Comparable Data Collection for Low-Resource Languages via Images [126.64069379167975]
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all.
arXiv Detail & Related papers (2020-04-24T19:30:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.