Related papers: A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages

A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages

URL: http://arxiv.org/abs/2510.09555v1
Date: Fri, 10 Oct 2025 17:06:50 GMT
Title: A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages
Authors: Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich,
Abstract summary: We present the first comprehensive study of multilingual Chain-of-Thought (CoT) reasoning.<n>We measure language compliance, answer accuracy, and answer consistency when LRMs are prompt-hacked to think in a target language.<n>We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language.
Score: 48.68444770923683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques -- i.e., truncation and error injection -- to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.

Related papers

When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training [57.230355403478995]
We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM.<n>We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent.<n>In contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior.
arXiv Detail & Related papers (2026-01-30T11:23:01Z)
Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective [52.452449102961225]
This study proposes a novel cross-linguistic perspective to investigate reasoning generalization.<n>Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm.<n>Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.
arXiv Detail & Related papers (2025-10-02T17:49:49Z)
Long Chain-of-Thought Reasoning Across Languages [14.79632337642471]
We investigate four key stages of model development: scaling, pretraining, post-training, and inference.<n>We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind.<n>Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training.
arXiv Detail & Related papers (2025-08-20T16:22:51Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [86.98098988779809]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.<n>We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.<n>Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z)
AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought [40.16140566668239]
We introduce AdaMCOT, a framework that enhances multilingual factual reasoning.<n>AdaMCOT dynamically routing thought processes in intermediary "thinking languages" before generating target-language responses.<n>Our evaluation demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency.
arXiv Detail & Related papers (2025-01-27T15:48:57Z)
Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models [16.942897938964638]
Large Language Models (LLMs) have shown exceptional performance in various Natural Language Processing (NLP) tasks. Despite their successes, these models often exhibit significant inconsistencies when processing the same concepts across different languages. This study focuses on three primary questions: the existence of cross-lingual inconsistencies in LLMs, the specific aspects in which these inconsistencies manifest, and the correlation between cross-lingual consistency and multilingual capabilities of LLMs.
arXiv Detail & Related papers (2024-07-01T15:11:37Z)
Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses [14.784624121891328]
We propose a novel paradigm for evaluating large language models (LLMs) We measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. Our approach does not require any static evaluation corpora in languages other than English.
arXiv Detail & Related papers (2023-05-19T13:23:51Z)
Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process. We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.