The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
- URL: http://arxiv.org/abs/2506.22724v1
- Date: Sat, 28 Jun 2025 02:09:21 GMT
- Title: The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure
- Authors: Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi,
- Abstract summary: We demonstrate the existence of an implicit task-solving-->translation pipeline for generation.<n>We test this hypothesis for a word translation task across 108 language pairs.<n>We find that a significant portion of overall failures indeed stems from translation failure.
- Score: 24.960044405878286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages. Building on insights from interpretability, we demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We test this hypothesis for a word translation task across 108 language pairs, using logit lens to observe model processing in intermediate layers. We find that a significant portion of overall failures indeed stems from translation failure, or the model's inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages. Our results highlight an important hurdle for end-to-end multilingual generation, and lend guiding insights for future work seeking to improve multilinguality in LLMs.
Related papers
- Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models [44.94287386776289]
We identify textbfCross-lingual Collapse, a systematic drift in which a multilingual language model reverts to its dominant pre-training language.<n>Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy.
arXiv Detail & Related papers (2025-06-06T08:08:48Z) - Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline [36.2731426595852]
We find that multilingual large language models (LLMs) exhibit significantly better performance in factual recall tasks in English than in other languages.<n>We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language.<n>Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language.
arXiv Detail & Related papers (2025-05-26T22:20:45Z) - Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z) - Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [49.09746599881631]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z) - Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models [49.16690802656554]
We find that Multilingual factual models struggle to provide consistent responses to semantically equivalent prompts in different languages.<n>We propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency.
arXiv Detail & Related papers (2025-04-05T19:43:10Z) - Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLMs [3.9530780161144667]
We investigate the multilingual mechanism by which Large Language Models internally operate in a latent space biased toward their inherently dominant language.<n>We train models to generate the chain-of-thought (CoT) in English while outputting the final response in the target language, given input in the low-resource language.<n>Our experiments demonstrate that this approach, named English-Pivoted CoT Training, outperforms other baselines, with up to 28.33% improvement.
arXiv Detail & Related papers (2025-04-02T16:58:36Z) - Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation [38.81102126876936]
This paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms.
To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers.
Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages.
arXiv Detail & Related papers (2024-11-18T05:41:27Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice? [33.376648335299116]
Large language models (LLMs) display strong translation capability after being fine-tuned on as few as 32 parallel sentences.
LLMs with only English on the target side can lead to task misinterpretation, which hinders translation into non-English languages.
synthesized data in an under-represented language has a less pronounced effect.
arXiv Detail & Related papers (2024-04-22T12:21:12Z) - Zero-shot Cross-lingual Transfer is Under-specified Optimization [49.3779328255767]
We show that any linear-interpolated model between the source language monolingual model and source + target bilingual model has equally low source language generalization error.
We also show that zero-shot solution lies in non-flat region of target language error generalization surface, causing the high variance.
arXiv Detail & Related papers (2022-07-12T16:49:28Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - Modelling Latent Translations for Cross-Lingual Transfer [47.61502999819699]
We propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model.
We evaluate our novel latent translation-based model on a series of multilingual NLU tasks.
We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average.
arXiv Detail & Related papers (2021-07-23T17:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.