Related papers: Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

URL: http://arxiv.org/abs/2601.22888v1
Date: Fri, 30 Jan 2026 12:08:08 GMT
Title: Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial
Authors: Jio Oh, Paul Vicinanza, Thomas Butler, Steven Euijong Whang, Dezhi Hong, Amani Namboori,
Abstract summary: More than 80% of the 1.6 billion English speakers do not use Standard American English.<n>We introduce $textbfMDial$, the first large-scale framework for generating multi-dialectal conversational data.
Score: 13.016574005932311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features -- for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users' morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.

Related papers

Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language [1.0742675209112622]
We examine whether structured prompts alone can elicit basic conversational ability under controlled prompting.<n>We combine explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play.<n>Our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy.
arXiv Detail & Related papers (2026-02-17T06:20:09Z)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z)
Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation [0.0]
This paper introduces the first effort to adapt large language models to the Ukrainian dialect Hutsul.<n>We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings.
arXiv Detail & Related papers (2025-06-09T10:30:35Z)
LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark [1.3927943269211591]
We propose a comprehensive framework that enhances Large Language Models (LLMs)-based machine translation evaluation.<n>We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers.<n>Our evaluation shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation.
arXiv Detail & Related papers (2025-05-18T07:24:13Z)
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Evaluating Dialect Robustness of Language Models via Conversation Understanding [2.8514881296685113]
We use English language (US English or Indian English) conversations between humans who play the word-guessing game of 'taboo'<n>We formulate two evaluative tasks: target word prediction (TWP) ($textiti.e.$, predict the masked target word in a conversation) and target word selection (TWS) ($textiti.e.$, select the most likely masked target word in a conversation)<n>We create two subsets: en-MV (where en-US is transformed to include dialectal information) and en-TR (where dialectal information is
arXiv Detail & Related papers (2024-05-09T11:38:23Z)
Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English. By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z)
Towards spoken dialect identification of Irish [5.1121440213561335]
The Irish language is rich in its diversity of dialects and accents. A recent study investigating dialect bias in Irish ASR found that performance for the Ulster dialect was consistently worse than for the Connacht or Munster dialects. The present experiments investigate spoken dialect identification of Irish, with a view to incorporating such a system into the speech recognition pipeline.
arXiv Detail & Related papers (2023-07-14T16:03:09Z)
Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects. Stress tests reveal significant performance disparities for leading models on non-standard dialects. We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z)
VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE) We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments. Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.