Documenting the English Colossal Clean Crawled Corpus
- URL: http://arxiv.org/abs/2104.08758v1
- Date: Sun, 18 Apr 2021 07:42:52 GMT
- Title: Documenting the English Colossal Clean Crawled Corpus
- Authors: Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel
Ilharco, Dirk Groeneveld, Matt Gardner
- Abstract summary: This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
- Score: 28.008953329187648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As language models are trained on ever more text, researchers are turning to
some of the largest corpora available. Unlike most other types of datasets in
NLP, large unlabeled text corpora are often presented with minimal
documentation, and best practices for documenting them have not been
established. In this work we provide the first documentation for the Colossal
Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a
set of filters to a single snapshot of Common Crawl. We begin with a high-level
summary of the data, including distributions of where the text came from and
when it was written. We then give more detailed analysis on salient parts of
this data, including the most frequent sources of text (e.g.,
patents.google.com, which contains a significant percentage of machine
translated and/or OCR'd text), the effect that the filters had on the data
(they disproportionately remove text in AAE), and evidence that some other
benchmark NLP dataset examples are contained in the text. We release a web
interface to an interactive, indexed copy of this dataset, encouraging the
community to continuously explore and report additional findings.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing.
Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity.
We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources.
Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z) - UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity [50.91030850662369]
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations.
This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios.
We contribute a new benchmark named textbfUFineBench for text-based person retrieval with ultra-fine granularity.
arXiv Detail & Related papers (2023-12-06T11:50:14Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - Not Just Plain Text! Fuel Document-Level Relation Extraction with
Explicit Syntax Refinement and Subsentence Modeling [3.9436257406798925]
We propose expLicit syntAx Refinement and Subsentence mOdeliNg based framework (LARSON)
By introducing extra syntactic information, LARSON can model subsentences of arbitrary granularity and efficiently screen instructive ones.
Experimental results on three benchmark datasets (DocRED, CDR, and GDA) demonstrate that LARSON significantly outperforms existing methods.
arXiv Detail & Related papers (2022-11-10T05:06:37Z) - Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis.
The first hierarchical scene text dataset is introduced to enable this novel research task.
We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - Topic Modeling Based Extractive Text Summarization [0.0]
We propose a novel method to summarize a text document by clustering its contents based on latent topics.
We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization.
arXiv Detail & Related papers (2021-06-29T12:28:19Z) - Rethinking Text Segmentation: A Novel Dataset and A Text-Specific
Refinement Approach [34.63444886780274]
Text segmentation is a prerequisite in real-world text-related tasks.
We introduce Text Refinement Network (TexRNet), a novel text segmentation approach.
TexRNet consistently improves text segmentation performance by nearly 2% compared to other state-of-the-art segmentation methods.
arXiv Detail & Related papers (2020-11-27T22:50:09Z) - GLEAKE: Global and Local Embedding Automatic Keyphrase Extraction [1.0681288493631977]
We introduce Global and Local Embedding Automatic Keyphrase Extractor (GLEAKE) for the task of automatic keyphrase extraction.
GLEAKE uses single and multi-word embedding techniques to explore the syntactic and semantic aspects of the candidate phrases.
It refines the most significant phrases as a final set of keyphrases.
arXiv Detail & Related papers (2020-05-19T20:24:02Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.