Related papers: Transfer Q Star: Principled Decoding for LLM Alignment

Transfer Q Star: Principled Decoding for LLM Alignment

URL: http://arxiv.org/abs/2405.20495v1
Date: Thu, 30 May 2024 21:36:12 GMT
Title: Transfer Q Star: Principled Decoding for LLM Alignment
Authors: Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, Furong Huang,
Abstract summary: Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
Score: 105.89114186982972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ($Q^*$), which is often unavailable in practice. Hence, prior SoTA methods either approximate this $Q^*$ using $Q^{\pi_{\texttt{sft}}}$ (derived from the reference $\texttt{SFT}$ model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer $Q^*$, which implicitly estimates the optimal value function for a target reward $r$ through a baseline model $\rho_{\texttt{BL}}$ aligned with a baseline reward $\rho_{\texttt{BL}}$ (which can be different from the target reward $r$). Theoretical analyses of Transfer $Q^*$ provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference $\texttt{SFT}$ model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

Related papers

Provably Efficient and Agile Randomized Q-Learning [35.14581235983678]
We propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates.<n> Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.
arXiv Detail & Related papers (2025-06-30T16:08:29Z)
One-Stage Top-$k$ Learning-to-Defer: Score-Based Surrogates with Theoretical Guarantees [3.6787328174619254]
We introduce the first one-stage Top-$k$ Learning-to-Defer framework.<n>We learn a shared score-based model that selects the $k$ most cost-effective entities-labels or experts-per input.<n>Experiments on CIFAR-10 and SVHN confirm that our one-stage Top-$k$ method strictly outperforms Top-1 deferral.
arXiv Detail & Related papers (2025-05-15T10:41:16Z)
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training [60.01594991938747]
$Qsharp$ is a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. Our results highlight $Qsharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees.
arXiv Detail & Related papers (2025-02-27T21:43:00Z)
Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models with human preferences. Current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations. We develop an active-learning algorithm, $textttAPO$, which enhances model alignment by querying preference data.
arXiv Detail & Related papers (2024-02-16T08:19:34Z)
Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z)
Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization [3.061662434597098]
We present a method for the regularization hyper- parameter, $lambda$, that is faster to compute than leave-one-out cross-validation (LOOCV) We show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions.
arXiv Detail & Related papers (2023-10-29T01:13:55Z)
Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy. This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z)
Bayesian Learning via Q-Exponential Process [10.551294837978363]
Regularization is one of the most fundamental topics in optimization, statistics and machine learning. In this work, we generalize the $q$-exponential distribution (with density proportional to) $exp( frac12|u|q)$ to a process named $Q$-exponential (Q-EP) process that corresponds to the $L_q$ regularization of functions.
arXiv Detail & Related papers (2022-10-14T17:37:14Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
UVIP: Model-Free Approach to Evaluate Reinforcement Learning Algorithms [5.582861438320171]
Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL) We present a novel model-free upper value iteration procedure $(sf UVIP)$ that allows us to estimate the suboptimality gap $Vstar(x) - Vpi(x)$ from above and to construct confidence intervals for $Vstar$. We provide theoretical guarantees for $sf UVIP$ under general assumptions and illustrate its performance on a number of benchmark RL problems.
arXiv Detail & Related papers (2021-05-05T15:38:36Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.