Related papers: Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

URL: http://arxiv.org/abs/2201.10474v2
Date: Wed, 26 Jan 2022 18:46:26 GMT
Title: Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Authors: Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy Z. Wang, Zeyu Wang, Luke Zettlemoyer, Noah A. Smith
Abstract summary: We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We argue that privileging any corpus as high quality entails a language ideology.
Score: 83.3580786484122
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.

Related papers

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages. We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z)
How Good is Your Wikipedia? [13.814955569390207]
This paper critically examines the data quality of Wikipedia in a non-English setting by subjecting it to various quality filtering techniques. We find that data quality pruning is an effective means for resource-efficient training without hurting performance.
arXiv Detail & Related papers (2024-11-08T12:35:58Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model [40.23569361268597]
We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
arXiv Detail & Related papers (2023-11-02T11:13:51Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering. We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z)
Does Corpus Quality Really Matter for Low-Resource Languages? [27.315905109092466]
The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4.
arXiv Detail & Related papers (2022-03-15T17:40:27Z)
Text Style Transfer for Bias Mitigation using Masked Language Modeling [9.350763916068026]
We present a text style transfer model that can be used to automatically debias textual data. Our model solves such issues by combining latent content encoding with explicit keyword replacement.
arXiv Detail & Related papers (2022-01-21T11:06:33Z)
Assessing the quality of sources in Wikidata across languages: a hybrid approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z)
Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems. The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z)
Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.