Related papers: Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

URL: http://arxiv.org/abs/2505.16538v2
Date: Wed, 17 Sep 2025 23:03:35 GMT
Title: Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Authors: Ercong Nie, Helmut Schmid, Hinrich Schütze,
Abstract summary: We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
Score: 56.61984030508691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion while largely preserving general competence and fluency. Our approach matches multilingual alignment in confusion reduction for many languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling. Code and data are available at: https://github.com/ercong21/lang_confusion.

Related papers

When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z)
Revisiting Modality Invariance in a Multilingual Speech-Text Model via Neuron-Level Analysis [15.638379666159127]
We investigate where language and modality information is encoded, how selective neurons causally influence decoding, and how concentrated this influence is across the network.<n>We identify language- and modality-selective neurons using average-precision ranking, investigate their functional role via median-replacement interventions at inference time, and analyze activation-magnitude inequality across languages and modalities.
arXiv Detail & Related papers (2026-01-24T09:22:18Z)
SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics [33.30877107523988]
A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages.<n>Recent studies suggest that multilingual language models often use English as an internal pivot language.<n>Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments.
arXiv Detail & Related papers (2026-01-22T09:49:29Z)
Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation [49.2073409243885]
Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency.<n>We conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.<n>We identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages.
arXiv Detail & Related papers (2026-01-01T08:53:49Z)
Language Surgery in Multilingual Large Language Models [32.77326546076424]
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages.<n>This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers.<n>We propose Inference-Time Language Control (ITLC) to enable precise cross-lingual language control and mitigate language confusion.
arXiv Detail & Related papers (2025-06-14T11:09:50Z)
The Emergence of Abstract Thought in Large Language Models Beyond Any Language [95.50197866832772]
Large language models (LLMs) function effectively across a diverse range of languages.<n>Preliminary studies observe that the hidden activations of LLMs often resemble English, even when responding to non-English prompts.<n>Recent results show strong multilingual performance, even surpassing English performance on specific tasks in other languages.
arXiv Detail & Related papers (2025-06-11T16:00:54Z)
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z)
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis [5.029635172046762]
Language Confusion is a phenomenon where Large Language Models (LLMs) generate text that is neither in the desired language, nor in a contextually appropriate language.<n>We introduce a novel metric, Language Confusion Entropy, designed to measure and quantify this confusion.
arXiv Detail & Related papers (2024-10-17T05:43:30Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Probing the Emergence of Cross-lingual Alignment during LLM Training [10.053333786023089]
Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance. We study how such cross-lingual alignment emerges during pre-training of LLMs. We observe a high correlation between neuron overlap and downstream performance.
arXiv Detail & Related papers (2024-06-19T05:31:59Z)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence. We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena. As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process. We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
Demystifying Neural Language Models' Insensitivity to Word-Order [7.72780997900827]
We investigate the insensitivity of natural language models to word-order by quantifying perturbations. We find that neural language models require local ordering more so than the global ordering of tokens.
arXiv Detail & Related papers (2021-07-29T13:34:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.