Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
- URL: http://arxiv.org/abs/2510.04230v1
- Date: Sun, 05 Oct 2025 14:39:41 GMT
- Title: Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
- Authors: Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu,
- Abstract summary: We introduce **Language-Mixed CoT**, a reasoning schema that switches between English and a target language.<n>We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc)<n>Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 pm 25)
- Score: 23.847410628315544
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct **Language-Mixed CoT**, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate **Yi-Sang**: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, **KO-REAson-35B**, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show **Language-Mixed CoT** is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.
Related papers
- mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [16.953894896444403]
We introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages.<n>We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models.<n>Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models.
arXiv Detail & Related papers (2025-10-01T17:36:59Z) - A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z) - Long Chain-of-Thought Reasoning Across Languages [14.79632337642471]
We investigate four key stages of model development: scaling, pretraining, post-training, and inference.<n>We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind.<n>Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training.
arXiv Detail & Related papers (2025-08-20T16:22:51Z) - LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation [2.9248916859490173]
We introduce a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task.<n>Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes.<n>We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness.
arXiv Detail & Related papers (2025-07-09T20:45:04Z) - IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models [18.083861654053585]
This paper introduces IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages.<n>We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings(where test sets are translated into English) across 10 open and six proprietary language models.<n>We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63% of the best-performing proprietary model GPT-4o performance.
arXiv Detail & Related papers (2024-06-05T15:23:08Z) - Multilingual Sentence-T5: Scalable Sentence Encoders for Multilingual Applications [4.240899165468488]
We introduce Multilingual Sentence T5 (m-ST5) as a larger model of NLI-based multilingual sentence embedding.
By employing the low-rank adaptation (LoRA) technique, we have achieved a successful scaling of the model's size to 5.7 billion parameters.
It was particularly noteworthy that languages with fewer resources or those with less linguistic similarity to English benefited more from the parameter increase.
arXiv Detail & Related papers (2024-03-26T09:31:55Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z) - Harnessing Multilinguality in Unsupervised Machine Translation for Rare
Languages [48.28540903568198]
We show that multilinguality is critical to making unsupervised systems practical for low-resource settings.
We present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions.
We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU.
arXiv Detail & Related papers (2020-09-23T15:07:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.