Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective
- URL: http://arxiv.org/abs/2602.02572v1
- Date: Sat, 31 Jan 2026 05:45:51 GMT
- Title: Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective
- Authors: Haichuan Wang, Tao Lin, Lingkai Kong, Ce Li, Hezi Jiang, Milind Tambe,
- Abstract summary: We show that a simple reward shaping scheme can effectively approximate the optimal reward model.<n>Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.
- Score: 33.36936642383929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.
Related papers
- Real-Time Aligned Reward Model beyond Semantics [49.717236911878224]
We introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework.<n>R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM.<n>This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.
arXiv Detail & Related papers (2026-01-30T07:32:35Z) - GDRO: Group-level Reward Post-training Suitable for Diffusion Models [55.948229011478304]
Group-level rewards successfully align the model with the targeted reward.<n>Group-level Direct Reward Optimization (GDRO) is a new post-training paradigm for group-level reward alignment.<n>GDRO supports full offline training that saves the large time cost for image rollout sampling.<n>It is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtainity.
arXiv Detail & Related papers (2026-01-05T11:47:18Z) - PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training [9.093854840532062]
PITA is a novel framework that integrates preference feedback directly into the LLM's token generation.<n> PITA learns a small preference-based guidance policy to modify token probabilities at inference time without fine-tuning.<n>We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification.
arXiv Detail & Related papers (2025-07-26T21:46:32Z) - Towards Reliable, Uncertainty-Aware Alignment [12.63619480522393]
We study the variability of reward model training on open-source benchmarks.<n>We propose a variance-aware policy optimization framework for preference-based alignment.
arXiv Detail & Related papers (2025-07-21T12:51:29Z) - CREAM: Consistency Regularized Self-Rewarding Language Models [34.325289477993586]
Self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to improve the alignment performance without the need of human annotations for preference data.<n>However, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data.<n>We propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages the consistency of rewards across different iterations to regularize the self-rewarding training.
arXiv Detail & Related papers (2024-10-16T16:51:01Z) - AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [43.77763433288893]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.<n>We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.<n>We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.
Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution.
We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.