What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus
- URL: http://arxiv.org/abs/2105.02732v2
- Date: Fri, 7 May 2021 17:28:38 GMT
- Title: What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus
- Authors: Alexandra Sasha Luccioni, Joseph D. Viviano
- Abstract summary: We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
- Score: 77.34726150561087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whereas much of the success of the current generation of neural language
models has been driven by increasingly large training corpora, relatively
little research has been dedicated to analyzing these massive sources of
textual data. In this exploratory analysis, we delve deeper into the Common
Crawl, a colossal web corpus that is extensively used for training language
models. We find that it contains a significant amount of undesirable content,
including hate speech and sexually explicit content, even after filtering
procedures. We conclude with a discussion of the potential impacts of this
content on language models and call for more mindful approach to corpus
collection and analysis.
Related papers
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages [1.0312968200748118]
We present an approach for translating word embeddings from a majority language into 4 minority languages.
Furthermore, we present a novel neural network model that is trained on English data to conduct sentiment analysis.
Our research shows that state-of-the-art neural models can be used with endangered languages.
arXiv Detail & Related papers (2023-05-24T17:40:20Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data [0.0]
We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data.
We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold.
This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
arXiv Detail & Related papers (2022-12-20T17:14:45Z) - Assessing the impact of contextual information in hate speech detection [0.48369513656026514]
We provide a novel corpus for contextualized hate speech detection based on user responses to news posts from media outlets on Twitter.
This corpus was collected in the Rioplatense dialectal variety of Spanish and focuses on hate speech associated with the COVID-19 pandemic.
arXiv Detail & Related papers (2022-10-02T09:04:47Z) - On the Effect of Pretraining Corpora on In-context Learning by a
Large-scale Language Model [56.82120834538467]
We investigate the effects of the source and size of the pretraining corpus on in-context learning in a Korean-centric GPT-3 model.
We find that in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning.
arXiv Detail & Related papers (2022-04-28T13:59:54Z) - Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Know thy corpus! Robust methods for digital curation of Web corpora [0.0]
This paper proposes a novel framework for digital curation of Web corpora.
It provides robust estimation of their parameters, such as their composition and the lexicon.
arXiv Detail & Related papers (2020-03-13T17:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.