Value-Guided Search for Efficient Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2505.17373v2
- Date: Tue, 30 Sep 2025 13:12:37 GMT
- Title: Value-Guided Search for Efficient Chain-of-Thought Reasoning
- Authors: Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun,
- Abstract summary: We propose a simple and efficient method for value model training on long-context reasoning traces.<n>By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model.<n>We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods.
- Score: 49.971608979012366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.
Related papers
- Advanced Black-Box Tuning of Large Language Models with Limited API Calls [20.29862533577494]
Black-box tuning is an emerging paradigm for adapting large language models (LLMs) to better achieve desired behaviors.<n>We propose a novel advanced black-box tuning method for LLMs with limited API calls.<n>Our approach elevates pre-trained language model accuracy from 55.92% to 86.85%, reducing the frequency of API queries to merely 1.38%.
arXiv Detail & Related papers (2025-11-13T11:32:08Z) - Logit Arithmetic Elicits Long Reasoning Capabilities Without Training [21.054461373109522]
We show that ThinkLogit can tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider.<n>Experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively.
arXiv Detail & Related papers (2025-10-10T13:07:14Z) - Logit Arithmetic Elicits Long Reasoning Capabilities Without Training [14.015546463427732]
Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction.<n>Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training.<n>We propose a decoding-time approach, ThinkLogit, to tune a target large LM for long reasoning using a substantially smaller model as guider.
arXiv Detail & Related papers (2025-07-17T03:31:36Z) - Kinetics: Rethinking Test-Time Scaling Laws [18.325591438335007]
Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones.<n>Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples.
arXiv Detail & Related papers (2025-06-05T17:59:24Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - Reasoning-Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented, Multi-Step Decision Framework [0.0]
We present a novel framework that bridges the gap between the interpretability of decision trees and the advanced reasoning capabilities of large language models (LLMs) to predict startup success.<n>Our approach leverages chain-of-thought prompting to generate detailed reasoning logs, which are subsequently distilled into structured, human-understandable logical rules.<n>Our method not only augments traditional decision-making processes but also facilitates expert intervention and continuous policy refinement.
arXiv Detail & Related papers (2025-04-16T13:53:42Z) - Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning [231.11339402237903]
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding.<n>Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA.<n>It demonstrates excellent reasoning abilities in STEM and coding.
arXiv Detail & Related papers (2025-04-10T17:10:51Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO)
Our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
arXiv Detail & Related papers (2024-06-16T09:06:17Z) - Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models [15.50128790503447]
We propose a novel and theoretically motivated methodology for pre-training data detection, named Min-K%++.<n>Specifically, we present a key insight that training samples tend to be local maxima of the modeled distribution along each input dimension through likelihood training.
arXiv Detail & Related papers (2024-04-03T04:25:01Z) - Reducing Variance in Temporal-Difference Value Estimation via Ensemble
of Deep Networks [109.59988683444986]
MeanQ is a simple ensemble method that estimates target values as ensemble means.
We show that MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark.
arXiv Detail & Related papers (2022-09-16T01:47:36Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Efficiently Teaching an Effective Dense Retriever with Balanced Topic
Aware Sampling [37.01593605084575]
TAS-Balanced is an efficient topic-aware query and balanced margin sampling technique.
We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets.
arXiv Detail & Related papers (2021-04-14T16:49:18Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.