Related papers: Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages

URL: http://arxiv.org/abs/2512.22712v1
Date: Sat, 27 Dec 2025 21:55:21 GMT
Title: Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
Authors: Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun,
Abstract summary: We analyze 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models.<n> Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts.
Score: 43.36259715282423
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.

Related papers

Align to the Pivot: Dual Alignment with Self-Feedback for Multilingual Math Reasoning [71.4175109189942]
We present Pivot-Aligned Self-Feedback Multilingual Reasoning (PASMR)<n>This approach designates the model's primary language as the pivot language.<n>It establishes a cross-lingual self-feedback mechanism without relying on external correct answers or reward models.
arXiv Detail & Related papers (2026-01-25T03:20:00Z)
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners [48.68444770923683]
Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks.<n>LRMs often arrive at the correct answer before completing these textual reasoning steps.<n>This phenomenon has been explored in English, but its multilingual behavior remains largely unknown.
arXiv Detail & Related papers (2026-01-06T13:20:17Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation [1.2576388595811496]
We introduce LINGOLY-TOO, a challenging reasoning benchmark grounded in natural language.<n>We permute reasoning problems written in real languages to generate numerous question variations.<n>Experiments and analyses show that models can circumvent reasoning and answer from prior knowledge.
arXiv Detail & Related papers (2025-03-04T19:57:47Z)
Reasoning Elicitation in Language Models via Counterfactual Feedback [17.908819732623716]
We derive novel metrics that balance accuracy in factual and counterfactual questions.<n>We propose several fine-tuning approaches that aim to elicit better reasoning mechanisms.<n>We evaluate the performance of the fine-tuned language models in a variety of realistic scenarios.
arXiv Detail & Related papers (2024-10-02T15:33:30Z)
Conceptual and Unbiased Reasoning in Language Models [98.90677711523645]
We propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions. We show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks. We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making.
arXiv Detail & Related papers (2024-03-30T00:53:53Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
ALERT: Adapting Language Models to Reasoning Tasks [43.8679673685468]
ALERT is a benchmark and suite of analyses for assessing language models' reasoning ability. ALERT provides a test bed to asses any language model on fine-grained reasoning skills. We find that language models learn more reasoning skills during finetuning stage compared to pretraining state.
arXiv Detail & Related papers (2022-12-16T05:15:41Z)
Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.