Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora
- URL: http://arxiv.org/abs/2509.17855v1
- Date: Mon, 22 Sep 2025 14:49:08 GMT
- Title: Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora
- Authors: Robert Litschko, Verena Blaschke, Diana Burkhardt, Barbara Plank, Diego Frassinelli,
- Abstract summary: We use Bavarian as a case study and investigate the lexical dialect understanding capability of Large Language Models (LLMs)<n>We use DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only.<n>We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma.
- Score: 38.54622638611305
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavarian as a case study and investigate the lexical dialect understanding capability of LLMs by examining how well they recognize and translate dialectal terms across different parts-of-speech. To this end, we introduce DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only, and use it to compile a ground truth dataset consisting of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our results show that LLMs perform best on nouns and lexically similar word pairs, and struggle most in distinguishing between direct translations and inflected variants. Interestingly, providing additional context in the form of example usages improves the translation performance, but reduces their ability to recognize dialect variants. This study highlights the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.
Related papers
- Steering LLMs toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation [17.99472063920348]
DIA-REFINE is a framework for goal-directed, inclusive dialect translation.<n>We introduce the dialect fidelity score (DFS) to quantify linguistic shift and the target dialect ratio (TDR) to measure the success of dialect translation.<n>Our work establishes a robust framework for goal-directed, inclusive dialect translation.
arXiv Detail & Related papers (2025-11-10T03:52:24Z) - LingGym: How Far Are LLMs from Thinking Like Field Linguists? [20.482844306874743]
This paper introduces LingGym, a new benchmark that evaluates LLMs' capacity for meta-linguistic reasoning.<n>We present a controlled evaluation task: Word-Gloss Inference, in which the model must infer a missing word and gloss from context.<n>Our results show that incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all models.
arXiv Detail & Related papers (2025-11-01T00:59:13Z) - Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z) - Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z) - Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing [19.6191088446367]
This study focuses on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs.<n>We evaluate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs.<n>We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.
arXiv Detail & Related papers (2025-01-15T20:22:35Z) - Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning [9.75748930802634]
We study the power of in-context-learning to improve machine learning performance.<n>We consider an extremely simple domain: next token prediction on simple language tasks.<n>We find that LLMs uniformly underperform n-gram models on this task.
arXiv Detail & Related papers (2025-01-06T07:57:51Z) - Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties [23.777874316083984]
There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs.
We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties.
We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency.
arXiv Detail & Related papers (2024-11-17T03:53:24Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.