Related papers: Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

URL: http://arxiv.org/abs/2502.17407v2
Date: Fri, 01 Aug 2025 10:09:29 GMT
Title: Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
Authors: Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne,
Abstract summary: We introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages.<n>We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)<n>Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2.
Score: 8.73181950200897
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

Related papers

Relative Scaling Laws for LLMs [91.73497548097775]
Scaling laws describe how language models improve with additional data, parameters, and compute.<n>We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale.<n>These results show that although scaling improves overall performance, it is not a universal equalizer.
arXiv Detail & Related papers (2025-10-28T16:55:22Z)
Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation [43.29267000439331]
Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars.<n>In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy.
arXiv Detail & Related papers (2025-10-10T17:57:04Z)
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs [49.01449646799905]
We show that most existing reasoning models do not extrapolate well.<n>Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores.<n>Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.
arXiv Detail & Related papers (2025-06-10T17:52:42Z)
T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models [9.674458633565111]
We investigate whether small language models (sLMs) can reliably self-verify their outputs under test-time scaling. We propose Tool-integrated self-verification (T1), which delegates-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance.
arXiv Detail & Related papers (2025-04-07T04:01:17Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z)
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z)
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes.<n>To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens.<n>We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z)
s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance. Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes. We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z)
Remember what you did so you know what to do next [10.526351131118096]
We create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues.
arXiv Detail & Related papers (2023-10-30T19:29:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.