MMATH: A Multilingual Benchmark for Mathematical Reasoning
- URL: http://arxiv.org/abs/2505.19126v1
- Date: Sun, 25 May 2025 12:47:39 GMT
- Title: MMATH: A Multilingual Benchmark for Mathematical Reasoning
- Authors: Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen,
- Abstract summary: We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
- Score: 94.05289799605957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.
Related papers
- Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [38.52080213211765]
We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages.<n>We propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning.<n>Our results show that BRIDGE significantly enhances multilingual reasoning fidelity.
arXiv Detail & Related papers (2025-07-07T19:04:36Z) - Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models [44.94287386776289]
We identify textbfCross-lingual Collapse, a systematic drift in which a multilingual language model reverts to its dominant pre-training language.<n>Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy.
arXiv Detail & Related papers (2025-06-06T08:08:48Z) - Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z) - Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task [73.35882908048423]
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP.<n>This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering.
arXiv Detail & Related papers (2025-04-04T17:35:43Z) - Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [71.12193680015622]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.<n>We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.<n>Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z) - mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models [21.616940026409818]
Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve downstream tasks.
We study multilingual reasoning consistency across multiple languages, using popular open-source LLMs.
We introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency.
arXiv Detail & Related papers (2024-06-04T13:30:45Z) - Eliciting Better Multilingual Structured Reasoning from LLMs through Code [17.870002864331322]
We introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages.
xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks.
We propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners.
arXiv Detail & Related papers (2024-03-05T00:48:56Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.