Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
- URL: http://arxiv.org/abs/2402.14688v2
- Date: Sun, 2 Jun 2024 15:05:59 GMT
- Title: Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
- Authors: Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener,
- Abstract summary: We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function.
At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting.
- Score: 16.801981347658625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .
Related papers
- Scalable Submodular Policy Optimization via Pruned Submodularity Graph [2.8672152503836]
In Reinforcement Learning (abbreviated as RL), an agent interacts with the environment via a set of possible actions, and a reward is generated from some unknown distribution.<n>The task here is to find an optimal set of actions such that the reward after a certain time step gets maximized.
arXiv Detail & Related papers (2025-07-18T11:42:07Z) - Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions [0.5416466085090772]
We introduce emphQuantile Reward Policy Optimization (QRPO), which learns from pointwise absolute rewards.<n> QRPO uses quantile rewards to enable regression to the closed-form solution of the KL-regularized RL objective.<n>It consistently achieves top performance on chat and coding evaluations.
arXiv Detail & Related papers (2025-07-10T17:56:24Z) - Provably Efficient and Agile Randomized Q-Learning [35.14581235983678]
We propose a novel variant of Q-learning algorithm, refereed to as RandomizedQ, which integrates sampling-based exploration with agile, step-wise, policy updates.<n> Empirically, RandomizedQ exhibits outstanding performance compared to existing Q-learning variants with both bonus-based and Bayesian-based exploration on standard benchmarks.
arXiv Detail & Related papers (2025-06-30T16:08:29Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.
As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.
By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Uncertainty Quantification in Retrieval Augmented Question Answering [57.05827081638329]
We propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with.
We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods.
arXiv Detail & Related papers (2025-02-25T11:24:52Z) - Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.
We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots [11.533449955841968]
We propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction.
We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee.
arXiv Detail & Related papers (2022-12-06T16:29:47Z) - Quantile Filtered Imitation Learning [49.11859771578969]
quantile filtered imitation learning (QFIL) is a policy improvement operator designed for offline reinforcement learning.
We prove that QFIL gives us a safe policy improvement step with function approximation.
We see that QFIL performs well on the D4RL benchmark.
arXiv Detail & Related papers (2021-12-02T03:08:23Z) - Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z) - Less is More: Data-Efficient Complex Question Answering over Knowledge
Bases [26.026065844896465]
We propose the Neural-Symbolic Complex Question Answering (NS-CQA) model, a data-efficient reinforcement learning framework for complex question answering.
Our framework consists of a neural generator and a symbolic executor that transforms a natural-language question into a sequence of primitive actions.
Our model is evaluated on two datasets: CQA, a recent large-scale complex question answering dataset, and WebQuestionsSP, a multi-hop question answering dataset.
arXiv Detail & Related papers (2020-10-29T18:42:44Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.