MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
- URL: http://arxiv.org/abs/2510.14305v1
- Date: Thu, 16 Oct 2025 04:59:52 GMT
- Title: MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
- Authors: Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda,
- Abstract summary: We introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning.<n>MathMist encompasses over 21K aligned question-answer pairs across seven languages.<n>We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models.
- Score: 6.8892368960722346
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
Related papers
- Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil [1.0499611180329804]
Large language models (LLMs) demonstrate strong mathematical reasoning in English.<n>But whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear.<n>We evaluate four prominent large language models using a taxonomy of six math problem types.
arXiv Detail & Related papers (2026-02-16T07:08:37Z) - Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [39.03934159726098]
M2A is a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions.<n>We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages.<n>Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks.
arXiv Detail & Related papers (2025-07-07T19:04:36Z) - MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z) - Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z) - Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems [36.45822358679126]
We introduce Spoken Math Question Answering (Spoken-MQA), a new benchmark designed to evaluate the mathematical reasoning capabilities of speech-based models.<n>Spoken-MQA covers a diverse set of math problems, including pure arithmetic, single-step and multi-step contextual reasoning, and knowledge-oriented reasoning problems.<n>We find that while some speech LLMs perform competitively on contextual reasoning tasks involving basic arithmetic, they still struggle with direct arithmetic problems.
arXiv Detail & Related papers (2025-05-21T01:07:00Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - RoMath: A Mathematical Reasoning Benchmark in Romanian [7.7559527224629266]
This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three subsets: Baccalaureate, Competitions and Synthetic.<n>By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models.<n>We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages.
arXiv Detail & Related papers (2024-09-17T11:03:46Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.