Related papers: Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Data Caricatures: On the Representation of African American Language in Pretraining Corpora

URL: http://arxiv.org/abs/2503.10789v1
Date: Thu, 13 Mar 2025 18:31:10 GMT
Title: Data Caricatures: On the Representation of African American Language in Pretraining Corpora
Authors: Nicholas Deas, Blake Vente, Amith Ananthram, Jessica A. Grieser, Desmond Patton, Shana Kleiner, James Shepard, Kathleen McKeown,
Abstract summary: We evaluate the quantity and quality of African American Language representation in 12 predominantly English, open-source pretraining corpora.<n>We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as little as 0.007% of documents.
Score: 8.238934128943123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL-speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as little as 0.007% of documents. We also find that more than 25% of AAL texts in C4 may be inappropriate for LLMs to generate and reinforce harmful stereotypes. Finally, we find that most automated language, toxicity, and quality filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.

Related papers

Evaluating the Usage of African-American Vernacular English in Large Language Models [5.242425502046959]
We investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE)<n>We compare their usage of AAVE to the usage of humans who native speak AAVE.<n>We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans.
arXiv Detail & Related papers (2026-02-25T01:28:01Z)
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? [37.04140252339949]
We develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics.<n>Our experimental results show that SSA-COMET models significantly outperform AfriCOMET.<n>All resources are released under open licenses to support future research.
arXiv Detail & Related papers (2025-06-05T02:16:56Z)
Rejected Dialects: Biases Against African American Language in Reward Models [15.888517781590398]
We introduce a framework for evaluating dialect biases in reward models.<n>We conduct experiments comparing reward model preferences and behavior on paired White Mainstream English (WME) and both machine-translated and human-written AAL corpora.<n>We show that reward models are less aligned with human preferences when processing AAL texts vs. WME ones.
arXiv Detail & Related papers (2025-02-18T13:45:42Z)
Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa [2.5055584842618175]
Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by identifying sentiments expressed in text. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa.
arXiv Detail & Related papers (2025-01-19T11:52:46Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages [11.512925610019474]
We compare four of the most relevant large, web-crawled corpora across eleven lower-resourced European languages. We find that there are clear differences in quality of the corpora, with MaCoCu and OSCAR obtaining the best results. We conclude that, in our experiments, the quality of the web-crawled corpora does not seem to play a significant role when training LMs.
arXiv Detail & Related papers (2024-03-13T16:56:33Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages. Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Evaluation of African American Language Bias in Natural Language Generation [9.823804049740916]
We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME) Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.
arXiv Detail & Related papers (2023-05-23T17:34:37Z)
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z)
Sign Language to Text Conversion in Real Time using Transfer Learning [0.0]
We propose a deep learning model trained on the American Sign Language. There has been an improvement in accuracy from 94% of CNN to 98.7% by Transfer Learning.
arXiv Detail & Related papers (2022-11-13T17:20:19Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
Intent Classification Using Pre-Trained Embeddings For Low Resource Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing. We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z)
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas. We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches. We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.