Value-Guided Search for Efficient Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2505.17373v1
- Date: Fri, 23 May 2025 01:05:07 GMT
- Title: Value-Guided Search for Efficient Chain-of-Thought Reasoning
- Authors: Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun,
- Abstract summary: We train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling.<n>With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks.
- Score: 43.99559903458839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024 & 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.
Related papers
- Logit Arithmetic Elicits Long Reasoning Capabilities Without Training [14.015546463427732]
Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction.<n>Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training.<n>We propose a decoding-time approach, ThinkLogit, to tune a target large LM for long reasoning using a substantially smaller model as guider.
arXiv Detail & Related papers (2025-07-17T03:31:36Z) - Kinetics: Rethinking Test-Time Scaling Laws [18.325591438335007]
Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones.<n>Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples.
arXiv Detail & Related papers (2025-06-05T17:59:24Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - Reasoning-Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented, Multi-Step Decision Framework [0.0]
We present a novel framework that bridges the gap between the interpretability of decision trees and the advanced reasoning capabilities of large language models (LLMs) to predict startup success.<n>Our approach leverages chain-of-thought prompting to generate detailed reasoning logs, which are subsequently distilled into structured, human-understandable logical rules.<n>Our method not only augments traditional decision-making processes but also facilitates expert intervention and continuous policy refinement.
arXiv Detail & Related papers (2025-04-16T13:53:42Z) - Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning [231.11339402237903]
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding.<n>Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA.<n>It demonstrates excellent reasoning abilities in STEM and coding.
arXiv Detail & Related papers (2025-04-10T17:10:51Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models [15.50128790503447]
We propose a novel and theoretically motivated methodology for pre-training data detection, named Min-K%++.<n>Specifically, we present a key insight that training samples tend to be local maxima of the modeled distribution along each input dimension through likelihood training.
arXiv Detail & Related papers (2024-04-03T04:25:01Z) - Reducing Variance in Temporal-Difference Value Estimation via Ensemble
of Deep Networks [109.59988683444986]
MeanQ is a simple ensemble method that estimates target values as ensemble means.
We show that MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark.
arXiv Detail & Related papers (2022-09-16T01:47:36Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Efficiently Teaching an Effective Dense Retriever with Balanced Topic
Aware Sampling [37.01593605084575]
TAS-Balanced is an efficient topic-aware query and balanced margin sampling technique.
We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets.
arXiv Detail & Related papers (2021-04-14T16:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.