Do Language Models Care About Text Quality? Evaluating Web-Crawled
Corpora Across 11 Languages
- URL: http://arxiv.org/abs/2403.08693v1
- Date: Wed, 13 Mar 2024 16:56:33 GMT
- Title: Do Language Models Care About Text Quality? Evaluating Web-Crawled
Corpora Across 11 Languages
- Authors: Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube\v{s}i\'c,
Miquel Espl\`a-Gomis, Gema Ram\'irez-S\'anchez, Antonio Toral
- Abstract summary: We compare four of the most relevant large, web-crawled corpora across eleven lower-resourced European languages.
We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results.
We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.
- Score: 11.512925610019474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large, curated, web-crawled corpora play a vital role in training language
models (LMs). They form the lion's share of the training data in virtually all
recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However,
despite this importance, relatively little attention has been given to the
quality of these corpora. In this paper, we compare four of the currently most
relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across
eleven lower-resourced European languages. Our approach is two-fold: first, we
perform an intrinsic evaluation by performing a human evaluation of the quality
of samples taken from different corpora; then, we assess the practical impact
of the qualitative differences by training specific LMs on each of the corpora
and evaluating their performance on downstream tasks. We find that there are
clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining
the best results. However, during the extrinsic evaluation, we actually find
that the CC100 corpus achieves the highest scores. We conclude that, in our
experiments, the quality of the web-crawled corpora does not seem to play a
significant role when training LMs.
Related papers
- DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models [78.51470038301436]
We introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing.
We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM.
Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
arXiv Detail & Related papers (2024-10-08T02:42:56Z) - QuRating: Selecting High-Quality Data for Training Language Models [64.83332850645074]
We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality.
In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value.
We train a Qur model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria.
arXiv Detail & Related papers (2024-02-15T06:36:07Z) - Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora [1.0995326465245927]
We show that there are significant quality differences between different portions of web-mined corpora.
We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.
arXiv Detail & Related papers (2024-02-12T07:03:14Z) - LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language.
We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer.
We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z) - Generative Models are Unsupervised Predictors of Page Quality: A
Colossal-Scale Study [86.62171568318716]
Large generative language models such as GPT-2 are well-known for their ability to generate text.
We show that unsupervised predictors of "page quality" emerge, able to detect low quality content without any training.
We conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
arXiv Detail & Related papers (2020-08-17T07:13:24Z) - El Departamento de Nosotros: How Machine Translated Corpora Affects
Language Models in MRC Tasks [0.12183405753834563]
Pre-training large-scale language models (LMs) requires huge amounts of text corpora.
We study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks.
We show that careful curation along with post-processing lead to improved performance and overall LMs robustness.
arXiv Detail & Related papers (2020-07-03T22:22:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.