Related papers: Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

URL: http://arxiv.org/abs/2510.00962v1
Date: Wed, 01 Oct 2025 14:35:16 GMT
Title: Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks
Authors: Eileen Pan, Anna Seo Gyeong Choi, Maartje ter Hoeve, Skyler Seto, Allison Koenecke,
Abstract summary: We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks.<n>We find that individual grammatical rules have varied effects on performance, but some are more consequential than others.
Score: 13.576753089930499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.

Related papers

LingGym: How Far Are LLMs from Thinking Like Field Linguists? [20.482844306874743]
This paper introduces LingGym, a new benchmark that evaluates LLMs' capacity for meta-linguistic reasoning.<n>We present a controlled evaluation task: Word-Gloss Inference, in which the model must infer a missing word and gloss from context.<n>Our results show that incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all models.
arXiv Detail & Related papers (2025-11-01T00:59:13Z)
Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar [27.3347020320559]
We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese.<n>We show in further experiments that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
arXiv Detail & Related papers (2025-05-26T07:08:47Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Language models align with human judgments on key grammatical constructions [24.187439110055404]
We re-evaluate large language models' (LLMs) performance using well-established practices. We find that models achieve high accuracy overall, but also capture fine-grained variation in human linguistic judgments.
arXiv Detail & Related papers (2024-01-19T19:36:54Z)
Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects. Stress tests reveal significant performance disparities for leading models on non-standard dialects. We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z)
Local Structure Matters Most in Most Languages [15.870989191524094]
We replicate a study on the importance of local structure, and the relative unimportance of global structure, in a multilingual setting. We find that the phenomenon observed on the English language broadly translates to over 120 languages.
arXiv Detail & Related papers (2022-11-09T16:58:44Z)
CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts. CLSE covers 74 different semantic types to support various applications from airline ticketing to video games. We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z)
VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE) We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments. Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z)
Word Frequency Does Not Predict Grammatical Knowledge in Language Models [2.1984302611206537]
We investigate whether there are systematic sources of variation in the language models' accuracy. We find that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models. We find that a novel noun's grammatical properties can be few-shot learned from various types of training data.
arXiv Detail & Related papers (2020-10-26T19:51:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.