Related papers: Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

URL: http://arxiv.org/abs/2501.02448v1
Date: Sun, 05 Jan 2025 05:57:22 GMT
Title: Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
Authors: Hyunwoo Ko, Guijin Son, Dasol Choi,
Abstract summary: Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks.<n>Despite strong reasoning capabilities in high-resource languages, a significant performance gap persists in other languages.<n>We propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.

Related papers

Translation as a Scalable Proxy for Multilingual Evaluation [41.29816736489213]
We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities?<n>We find that translation performance is a good indicator of downstream task success.
arXiv Detail & Related papers (2026-01-16T21:01:40Z)
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought [23.847410628315544]
We introduce **Language-Mixed CoT**, a reasoning schema that switches between English and a target language.<n>We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc)<n>Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 pm 25)
arXiv Detail & Related papers (2025-10-05T14:39:41Z)
Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis [0.0]
Bengali is an underrepresented language in NLP research.<n>We systematically investigate the challenges that hinder Bengali NLP performance.<n>Our findings reveal consistent performance gaps for Bengali compared to English.
arXiv Detail & Related papers (2025-07-31T05:16:43Z)
MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
KoBALT: Korean Benchmark For Advanced Linguistic Tasks [0.6971903955510721]
KoBALT (Korean Benchmark for Advanced Linguistic Tasks) is a linguistically-motivated benchmark comprising 700 multiple-choice questions.<n>It is designed to advance the evaluation of large language models (LLMs) in Korean.<n>It introduces a suite of expert-curated, linguistically motivated questions with minimal n-gram overlap with standard Korean corpora.
arXiv Detail & Related papers (2025-05-22T02:03:07Z)
Cross-Lingual Consistency: A Novel Inference Framework for Advancing Reasoning in Large Language Models [10.231866835957538]
Chain-of-thought (CoT) has emerged as a critical mechanism for enhancing reasoning capabilities in large language models (LLMs) We propose the Cross-Lingual Consistency (CLC) framework, which integrates multilingual reasoning paths through majority voting to elevate LLMs' reasoning capabilities. Empirical evaluations on the CMATH dataset reveal CLC's superiority over the conventional self-consistency method.
arXiv Detail & Related papers (2025-04-02T16:09:39Z)
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [71.12193680015622]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings. We train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z)
Multi-Step Reasoning in Korean and the Emergent Mirage [0.0]
We introduce HRMCR (HAE-RAE Multi-Step Commonsense Reasoning), a benchmark designed to evaluate large language models' ability to perform multi-step reasoning in culturally specific contexts. The questions are automatically generated via templates and algorithms, requiring LLMs to integrate Korean cultural knowledge into sequential reasoning steps. Our experiments reveal that models trained on fewer than (2 cdot 1025) training FLOPs struggle to solve any questions, showing near-zero performance.
arXiv Detail & Related papers (2025-01-10T05:07:27Z)
SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment [78.4550589538805]
We propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs.
arXiv Detail & Related papers (2025-01-07T10:29:43Z)
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models. We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models. We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)
Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models. Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.