Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection
- URL: http://arxiv.org/abs/2201.10474v2
- Date: Wed, 26 Jan 2022 18:46:26 GMT
- Title: Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection
- Authors: Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy
Z. Wang, Zeyu Wang, Luke Zettlemoyer, Noah A. Smith
- Abstract summary: We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
- Score: 83.3580786484122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models increasingly rely on massive web dumps for diverse text data.
However, these sources are rife with undesirable content. As such, resources
like Wikipedia, books, and newswire often serve as anchors for automatically
selecting web text most suitable for language modeling, a process typically
referred to as quality filtering. Using a new dataset of U.S. high school
newspaper articles -- written by students from across the country -- we
investigate whose language is preferred by the quality filter used for GPT-3.
We find that newspapers from larger schools, located in wealthier, educated,
and urban ZIP codes are more likely to be classified as high quality. We then
demonstrate that the filter's measurement of quality is unaligned with other
sensible metrics, such as factuality or literary acclaim. We argue that
privileging any corpus as high quality entails a language ideology, and more
care is needed to construct training corpora for language models, with better
transparency and justification for the inclusion or exclusion of various texts.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - A New Korean Text Classification Benchmark for Recognizing the Political
Intents in Online Newspapers [6.633601941627045]
We present a novel Korean text classification dataset that contains various articles.
Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea.
To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems.
arXiv Detail & Related papers (2023-11-03T04:59:55Z) - ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with
Effective Evaluation Model [40.23569361268597]
We propose a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data.
We release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score.
arXiv Detail & Related papers (2023-11-02T11:13:51Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - Does Corpus Quality Really Matter for Low-Resource Languages? [27.315905109092466]
The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl.
Taking Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl.
Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4.
arXiv Detail & Related papers (2022-03-15T17:40:27Z) - Text Style Transfer for Bias Mitigation using Masked Language Modeling [9.350763916068026]
We present a text style transfer model that can be used to automatically debias textual data.
Our model solves such issues by combining latent content encoding with explicit keyword replacement.
arXiv Detail & Related papers (2022-01-21T11:06:33Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics.
Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage.
All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.