Do Language Models Reason Across Languages?
- URL: http://arxiv.org/abs/2601.06644v1
- Date: Sat, 10 Jan 2026 17:59:34 GMT
- Title: Do Language Models Reason Across Languages?
- Authors: Yan Meng, Wafaa Mohammed, Christof Monz,
- Abstract summary: We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information.<n>We propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions.
- Score: 19.660512783888016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The real-world information sources are inherently multilingual, which naturally raises a question about whether language models can synthesize information across languages. In this paper, we introduce a simple two-hop question answering setting, where answering a question requires making inferences over two multilingual documents. We find that language models are more sensitive to language variation in answer-span documents than in those providing bridging information, despite the equal importance of both documents for answering a question. Under a step-by-step sub-question evaluation, we further show that in up to 33% of multilingual cases, models fail to infer the bridging information in the first step yet still answer the overall question correctly. This indicates that reasoning in language models, especially in multilingual settings, does not follow a faithful step-by-step decomposition. Subsequently, we show that the absence of reasoning decomposition leads to around 18% composition failure, where both sub-questions are answered correctly but fail for the final two-hop questions. To mitigate this, we propose a simple three-stage SUBQ prompting method to guide the multi-step reasoning with sub-questions, which boosts accuracy from 10.1% to 66.5%.
Related papers
- Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training [50.177839592528294]
Long reasoning models often struggle in multilingual settings.<n>We propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning.
arXiv Detail & Related papers (2026-02-05T17:55:09Z) - Align to the Pivot: Dual Alignment with Self-Feedback for Multilingual Math Reasoning [71.4175109189942]
We present Pivot-Aligned Self-Feedback Multilingual Reasoning (PASMR)<n>This approach designates the model's primary language as the pivot language.<n>It establishes a cross-lingual self-feedback mechanism without relying on external correct answers or reward models.
arXiv Detail & Related papers (2026-01-25T03:20:00Z) - Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG [55.258582772528506]
We investigate whether the mixture of different document languages impacts generation and citation in unintended ways.<n>Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English.<n>We find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone.
arXiv Detail & Related papers (2025-09-17T12:58:18Z) - Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [39.03934159726098]
M2A is a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions.<n>We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages.<n>Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks.
arXiv Detail & Related papers (2025-07-07T19:04:36Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - Bridging the Language Gap: Knowledge Injected Multilingual Question
Answering [19.768708263635176]
We propose a generalized cross-lingual transfer framework to enhance the model's ability to understand different languages.
Experiment results on real-world datasets MLQA demonstrate that the proposed method can improve the performance by a large margin.
arXiv Detail & Related papers (2023-04-06T15:41:25Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - TyDi QA: A Benchmark for Information-Seeking Question Answering in
Typologically Diverse Languages [27.588857710802113]
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs.
We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena.
arXiv Detail & Related papers (2020-03-10T21:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.