Related papers: Evaluation of African American Language Bias in Natural Language Generation

Evaluation of African American Language Bias in Natural Language Generation

URL: http://arxiv.org/abs/2305.14291v2
Date: Mon, 13 Nov 2023 01:41:43 GMT
Title: Evaluation of African American Language Bias in Natural Language Generation
Authors: Nicholas Deas and Jessi Grieser and Shana Kleiner and Desmond Patton and Elsbeth Turcan and Kathleen McKeown
Abstract summary: We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME) Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.
Score: 9.823804049740916
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME), the encouraged "standard" form of English taught in American classrooms. We measure LLM performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates AAL (or WME) given WME (or AAL), and a masked span prediction (MSP) task, where models predict a phrase that was removed from their input. Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.

Related papers

Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning [0.0]
This paper presents a benchmark of several open-source Large Language Models (LLMs) for Persian Natural Language Processing tasks.<n>We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering.<n>The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms.
arXiv Detail & Related papers (2025-10-05T10:10:04Z)
Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data [2.812898346527047]
This study investigates the capabilities of large language models (LLMs) for zero-shot and few-shot annotation of social media posts in Russian and Ukrainian.<n>To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels.<n>The study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability.
arXiv Detail & Related papers (2025-05-15T13:10:47Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks. We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs. These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Rejected Dialects: Biases Against African American Language in Reward Models [15.888517781590398]
We introduce a framework for evaluating dialect biases in reward models. We conduct experiments comparing reward model preferences and behavior on paired White Mainstream English (WME) and both machine-translated and human-written AAL corpora. We show that reward models are less aligned with human preferences when processing AAL texts vs. WME ones.
arXiv Detail & Related papers (2025-02-18T13:45:42Z)
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We present the first study aimed at objectively assessing the fairness and robustness of Large Language Models (LLMs) in handling dialects in canonical reasoning tasks. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. Our findings reveal that textbfalmost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Generating bilingual example sentences with large language models as lexicography assistants [2.6550899846546527]
We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility.
arXiv Detail & Related papers (2024-10-04T06:45:48Z)
Open Generative Large Language Models for Galician [1.3049334790726996]
Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap.
arXiv Detail & Related papers (2024-06-19T23:49:56Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets [2.8123257987021058]
We focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks.
arXiv Detail & Related papers (2024-02-12T19:25:11Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
Pre-Trained Language-Meaning Models for Multilingual Parsing and Generation [14.309869321407522]
We introduce multilingual pre-trained language-meaning models based on Discourse Representation Structures (DRSs) Since DRSs are language neutral, cross-lingual transfer learning is adopted to further improve the performance of non-English tasks. automatic evaluation results show that our approach achieves the best performance on both the multilingual DRS parsing and DRS-to-text generation tasks.
arXiv Detail & Related papers (2023-05-31T19:00:33Z)
LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task. We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z)
Underspecification in Language Modeling Tasks: A Causality-Informed Study of Gendered Pronoun Resolution [0.0]
We introduce a simple causal mechanism to describe the role underspecification plays in the generation of spurious correlations. Despite its simplicity, our causal model directly informs the development of two lightweight black-box evaluation methods.
arXiv Detail & Related papers (2022-09-30T23:10:11Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.