Related papers: Iterative Value Function Optimization for Guided Decoding

Iterative Value Function Optimization for Guided Decoding

URL: http://arxiv.org/abs/2503.02368v2
Date: Wed, 05 Mar 2025 09:12:25 GMT
Title: Iterative Value Function Optimization for Guided Decoding
Authors: Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Zhaochen Su, Wenliang Chen, Jing Shao,
Abstract summary: Guided decoding, especially value-guided methods, offers a cost-effective alternative to Reinforcement Learning from Human Feedback.<n>The accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making.<n>Existing methods struggle with accurately estimating the optimal value function, leading to less effective control.
Score: 20.188412650073225
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Reinforcement Learning from Human Feedback (RLHF) has become the predominant method for controlling language model outputs, it suffers from high computational costs and training instability. Guided decoding, especially value-guided methods, offers a cost-effective alternative by controlling outputs without re-training models. However, the accuracy of the value function is crucial for value-guided decoding, as inaccuracies can lead to suboptimal decision-making and degraded performance. Existing methods struggle with accurately estimating the optimal value function, leading to less effective control. We propose Iterative Value Function Optimization, a novel framework that addresses these limitations through two key components: Monte Carlo Value Estimation, which reduces estimation variance by exploring diverse trajectories, and Iterative On-Policy Optimization, which progressively improves value estimation through collecting trajectories from value-guided policies. Extensive experiments on text summarization, multi-turn dialogue, and instruction following demonstrate the effectiveness of value-guided decoding approaches in aligning language models. These approaches not only achieve alignment but also significantly reduce computational costs by leveraging principled value function optimization for efficient and effective control.

Related papers

Cost-aware Stopping for Bayesian Optimization [53.34052774820105]
We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of tuning.<n>We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with state-of-the-art acquisition functions.
arXiv Detail & Related papers (2025-07-16T17:54:14Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning ( RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.<n>We propose leave-one-out PPO ( LOOP), a novel RL for diffusion fine-tuning method.<n>Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values [31.415598465903884]
Direct Value Optimization (DVO) is an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks.<n>DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss.<n>Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques.
arXiv Detail & Related papers (2025-02-19T13:51:05Z)
Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z)
Efficient Estimation and Sequential Optimization of Cost Functions in Variational Quantum Algorithms [1.4981317129908267]
We introduce a novel optimization methodology that conceptualizes the parameterized quantum circuit as a weighted sum of distinct unitary operators.<n>This representation facilitates the efficient evaluation of nonlocal characteristics of cost functions, as well as their arbitrary derivatives.<n>Our findings reveal substantial enhancements in convergence speed and accuracy relative to traditional optimization methods.
arXiv Detail & Related papers (2024-12-30T14:24:53Z)
Direct Preference Optimization Using Sparse Feature-Level Constraints [47.15096507230884]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z)
Landscape-Sketch-Step: An AI/ML-Based Metaheuristic for Surrogate Optimization Problems [0.0]
We introduce a newimats for global optimization in scenarios where extensive evaluations of the cost function are expensive, inaccessible, or even prohibitive. The method, which we call Landscape-Sketch-and-Step (LSS), combines Machine Learning, Replica Optimization, and Reinforcement Learning techniques.
arXiv Detail & Related papers (2023-09-14T01:53:45Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z)
Neural Solvers for Fast and Accurate Numerical Optimal Control [12.80824586913772]
This paper provides techniques to improve the quality of optimized control policies given a fixed computational budget. We achieve the above via a hypersolvers approach, which hybridizes a differential equation solver and a neural network.
arXiv Detail & Related papers (2022-03-13T10:46:50Z)
Implicit Rate-Constrained Optimization of Non-decomposable Objectives [37.43791617018009]
We consider a family of constrained optimization problems arising in machine learning. Our key idea is to formulate a rate-constrained optimization that expresses the threshold parameter as a function of the model parameters. We show how the resulting optimization problem can be solved using standard gradient based methods.
arXiv Detail & Related papers (2021-07-23T00:04:39Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.