Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
- URL: http://arxiv.org/abs/2103.12028v1
- Date: Mon, 22 Mar 2021 17:30:33 GMT
- Title: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
- Authors: Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch,
Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov,
Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb,
Beno\^it Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey
Osei, Pedro Javier Ortiz Su\'arez, Iroro Orife, Kelechi Ogueji, Rubungo Andre
Niyongabo, Toan Q. Nguyen, Mathias M\"uller, Andr\'e M\"uller, Shamsuddeen
Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov,
Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine
Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile
Dlamini, Nisansa de Silva, Sakine \c{C}abuk Ball{\i}, Stella Biderman,
Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe
Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia,
Sweta Agrawal, Mofetoluwa Adeyemi
- Abstract summary: We manually audit the quality of 205 language-specific corpora released with five major public datasets.
We find that at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality.
We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses.
- Score: 21.375943264243144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the success of large-scale pre-training and multilingual modeling in
Natural Language Processing (NLP), recent years have seen a proliferation of
large, web-mined text datasets covering hundreds of languages. However, to date
there has been no systematic analysis of the quality of these publicly
available datasets, or whether the datasets actually contain content in the
languages they claim to represent. In this work, we manually audit the quality
of 205 language-specific corpora released with five major public datasets
(CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of
language codes in a sixth (JW300). We find that lower-resource corpora have
systematic issues: at least 15 corpora are completely erroneous, and a
significant fraction contains less than 50% sentences of acceptable quality.
Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous
language codes. We demonstrate that these issues are easy to detect even for
non-speakers of the languages in question, and supplement the human judgements
with automatic analyses. Inspired by our analysis, we recommend techniques to
evaluate and improve multilingual corpora and discuss the risks that come with
low-quality data releases.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model [40.23569361268597]
We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
arXiv Detail & Related papers (2023-11-02T11:13:51Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted
Sentiment Classification Benchmark [7.888702613862612]
This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models.
The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature.
We present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.
arXiv Detail & Related papers (2023-06-13T16:54:13Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Language ID in the Wild: Unexpected Challenges on the Path to a
Thousand-Language Web Text Corpus [15.807197703827818]
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets.
We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages.
We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
arXiv Detail & Related papers (2020-10-27T19:29:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.