What's In My Big Data?
- URL: http://arxiv.org/abs/2310.20707v2
- Date: Tue, 5 Mar 2024 20:02:31 GMT
- Title: What's In My Big Data?
- Authors: Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander,
Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini,
Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge
- Abstract summary: We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
- Score: 67.04525616289949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large text corpora are the backbone of language models. However, we have a
limited understanding of the content of these corpora, including general
statistics, quality, social factors, and inclusion of evaluation data
(contamination). In this work, we propose What's In My Big Data? (WIMBD), a
platform and a set of sixteen analyses that allow us to reveal and compare the
contents of large text corpora. WIMBD builds on two basic capabilities -- count
and search -- at scale, which allows us to analyze more than 35 terabytes on a
standard compute node. We apply WIMBD to ten different corpora used to train
popular language models, including C4, The Pile, and RedPajama. Our analysis
uncovers several surprising and previously undocumented findings about these
corpora, including the high prevalence of duplicate, synthetic, and low-quality
content, personally identifiable information, toxic language, and benchmark
contamination. For instance, we find that about 50% of the documents in
RedPajama and LAION-2B-en are duplicates. In addition, several datasets used
for benchmarking models trained on such corpora are contaminated with respect
to important benchmarks, including the Winograd Schema Challenge and parts of
GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a
standard set of evaluations for new text-based corpora and to encourage more
analyses and transparency around them.
Related papers
- UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis [7.952225508086861]
In academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats.
We introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs.
arXiv Detail & Related papers (2024-06-21T14:29:39Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - Separating the Wheat from the Chaff with BREAD: An open-source benchmark
and metrics to detect redundancy in text [9.484323358958706]
We create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content.
We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD.
arXiv Detail & Related papers (2023-11-11T00:11:50Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets [21.375943264243144]
We manually audit the quality of 205 language-specific corpora released with five major public datasets.
We find that at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality.
We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses.
arXiv Detail & Related papers (2021-03-22T17:30:33Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.