Related papers: Scaling Test-Time Compute Without Verification or RL is Suboptimal

Scaling Test-Time Compute Without Verification or RL is Suboptimal

URL: http://arxiv.org/abs/2502.12118v2
Date: Tue, 18 Feb 2025 18:54:12 GMT
Title: Scaling Test-Time Compute Without Verification or RL is Suboptimal
Authors: Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar,
Abstract summary: We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
Score: 70.28430200655919
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

Related papers

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning [60.67176246634741]
We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem. We show that state-of-the-art models do not minimize regret, but one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL.
arXiv Detail & Related papers (2025-03-10T17:40:43Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning [32.45574194957491]
We show that training with cross-entropy loss can be misaligned with pass@N in that pass@N accuracy $it decreases$ with longer training.<n>We suggest a principled, modified training loss that is better aligned to pass@N by limiting model confidence and rescuing pass@N test performance.
arXiv Detail & Related papers (2025-02-11T00:33:31Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
A Bayesian Approach to Data Point Selection [24.98069363998565]
Data point selection (DPS) is becoming a critical topic in deep learning. Existing approaches to DPS are predominantly based on a bi-level optimisation (BLO) formulation. We propose a novel Bayesian approach to DPS.
arXiv Detail & Related papers (2024-11-06T09:04:13Z)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
Efficient Discrepancy Testing for Learning with Distribution Shift [17.472049019016524]
We provide the first set of provably efficient algorithms for testing localized discrepancy distance. Results imply a broad set of new, efficient learning algorithms in the recently introduced model of Testable Learning with Distribution Shift.
arXiv Detail & Related papers (2024-06-13T17:51:10Z)
Improve Mathematical Reasoning in Language Models by Automated Process Supervision [23.807288360423193]
We propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.<n>We are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs)<n>This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs' math reasoning performances.
arXiv Detail & Related papers (2024-06-05T19:25:40Z)
Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM) We propose a decoding algorithm integrating the self-evaluation guidance via beam search. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
You Only Need End-to-End Training for Long-Tailed Recognition [8.789819609485225]
Cross-entropy loss tends to produce highly correlated features on imbalanced data. We propose two novel modules, Block-based Relatively Balanced Batch Sampler (B3RS) and Batch Embedded Training (BET) Experimental results on the long-tailed classification benchmarks, CIFAR-LT and ImageNet-LT, demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-12-11T11:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.