Language Models can Self-Improve at State-Value Estimation for Better Search
- URL: http://arxiv.org/abs/2503.02878v2
- Date: Mon, 07 Jul 2025 16:20:17 GMT
- Title: Language Models can Self-Improve at State-Value Estimation for Better Search
- Authors: Ethan Mendes, Alan Ritter,
- Abstract summary: We present self-taught look (STL), a self-ahead method that leverages state-transition dynamics to improve a value model.<n>We find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods.
- Score: 16.933525465335524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Collecting ground-truth rewards or human demonstrations for multi-step reasoning tasks is often prohibitively expensive and time consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead (STL), a self-supervised method that leverages state-transition dynamics to improve a value model capable of effectively guiding language model-controlled search without any labeled data. We find that moderately sized (8 billion parameters) open-weight value models improved with STL can match the performance of using a gpt-4o value model. Furthermore, we find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods, while reducing costs by an order of magnitude.
Related papers
- Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models [4.841487377596519]
This paper investigates whether Large Language Models' code-comprehension performance aligns with traditional human-centric software metrics.<n>We introduce a diagnostic framework that reframes code understanding as a binary input-output consistency task.
arXiv Detail & Related papers (2026-01-19T10:58:24Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning [20.41220110321494]
We propose Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning.<n>STEER is a domain-agnostic framework that performs fine-grained, step-level routing between smaller and larger language models.<n>Our results establish model-internal confidence as a robust, domain-agnostic signal for model routing.
arXiv Detail & Related papers (2025-11-09T02:33:08Z) - Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z) - Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning [71.3533541927459]
We propose a novel data selection paradigm termed Activation Reasoning Potential (RAP)<n>RAP identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.<n>Our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.
arXiv Detail & Related papers (2025-06-05T08:40:24Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.
Our approach only leverages a small number of samples to search for the desired pruning policy.
We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Latent Thought Models with Variational Bayes Inference-Time Computation [52.63299874322121]
Latent Thought Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - Rational Metareasoning for Large Language Models [17.479428400594028]
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs)<n>This work introduces a novel approach based on computational models of metareasoning used in cognitive science.<n>We develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning.
arXiv Detail & Related papers (2024-10-07T23:48:52Z) - Large Language Models Can Self-Improve At Web Agent Tasks [37.17001438055515]
Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion.
We explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark.
We achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure.
arXiv Detail & Related papers (2024-05-30T17:52:36Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language
Model Outputs [20.772266479533776]
AXOLOTL is a novel post-processing framework that operates agnostically across tasks and models.
It identifies biases, proposes resolutions, and guides the model to self-debias its outputs.
This approach minimizes computational costs and preserves model performance.
arXiv Detail & Related papers (2024-03-01T00:02:37Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking [16.057622631156164]
Large language models (LLMs) have revolutionized the landscape of Natural Language Processing systems, but are computationally expensive.
Previous studies have explored various approaches to harness the potential of Small Language Models (SLMs) as cost-effective alternatives to their larger counterparts.
This work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance.
arXiv Detail & Related papers (2023-11-16T10:30:55Z) - GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling [0.0]
We develop GateLoop, a sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet.
GateLoop empirically outperforms existing models for auto-regressive language modeling.
We prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention.
arXiv Detail & Related papers (2023-11-03T14:08:39Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [25.464269324261636]
Multi-object tracking (MOT) is the problem of tracking the state of an unknown and time-varying number of objects using noisy measurements.
Deep learning (DL) has been increasingly used in MOT for improving tracking performance.
In this paper, we propose a Transformer-based DL tracker and evaluate its performance in the model-based setting.
arXiv Detail & Related papers (2022-02-16T07:43:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.