Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
- URL: http://arxiv.org/abs/2510.09599v1
- Date: Fri, 10 Oct 2025 17:57:04 GMT
- Title: Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation
- Authors: Sondos Mahmoud Bsharat, Zhiqiang Shen,
- Abstract summary: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars.<n>In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy.
- Score: 43.29267000439331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.
Related papers
- Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention [46.18660010248197]
Minimal Test-Time Intervention (MTI) is a training-free framework that enhances reasoning accuracy and stability with minimal overhead.<n>MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning.
arXiv Detail & Related papers (2025-10-15T17:59:45Z) - Large Language Models Imitate Logical Reasoning, but at what Cost? [0.42970700836450487]
We present a study which evaluates the reasoning capability of frontier Large Language Models over an eighteen month period.<n>We measured the accuracy of three leading models from December 2023, September 2024 and June 2025 on true or false questions.<n>The improvement in performance from 2023 to 2024 can be attributed to hidden Chain of Thought prompting.
arXiv Detail & Related papers (2025-09-16T04:03:42Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - First Finish Search: Efficient Test-Time Scaling in Large Language Models [20.62274005080048]
First Finish Search (FFS) is a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes.<n>FFS achieves $82.23%$ accuracy on the AIME datasets, a $15%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance.
arXiv Detail & Related papers (2025-05-23T17:57:43Z) - Dynamic Early Exit in Reasoning Models [21.30793518631921]
Overthinking in long chain-of-thought (CoT) generation slows down the efficiency of problem solving, but also risks accuracy loss.<n>We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation.<n>Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs.
arXiv Detail & Related papers (2025-04-22T13:36:53Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [108.07030347318624]
We show that scaling with longer Chain of Thoughts (CoTs) can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.<n>We propose a Thinking- Optimal Scaling strategy to teach models to adopt different reasoning efforts for deep thinking.<n>Our self-improvement models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning [8.73181950200897]
We introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages.<n>We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)<n>Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2.
arXiv Detail & Related papers (2025-02-24T18:36:15Z) - LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT)<n>We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA)<n>With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z) - Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling [69.57918638435491]
Test-Time Scaling is an important method for improving the performance of Large Language Models.<n>This paper focuses on two core questions: What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels?<n>We show that with our compute-optimal TTS strategy, extremely small policy models can outperform larger models.
arXiv Detail & Related papers (2025-02-10T17:30:23Z) - T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - Hint of Pseudo Code (HoPC): Zero-Shot Step by Step Pseudo Code Reasoning Prompting [28.103214021041097]
This paper introduces a novel Hint of Pseudo Code (HoPC) prompting technique.<n>HoPC incorporates a more powerful zero-shot problem decomposition and semantic code reasoning capabilities than zero-shot CoT.
arXiv Detail & Related papers (2023-05-19T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.