T-REG: Preference Optimization with Token-Level Reward Regularization
- URL: http://arxiv.org/abs/2412.02685v1
- Date: Tue, 03 Dec 2024 18:56:07 GMT
- Title: T-REG: Preference Optimization with Token-Level Reward Regularization
- Authors: Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng,
- Abstract summary: Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models with human values.
Recent methods have attempted to address this limitation by introducing token-level rewards.
We propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization.
- Score: 35.07328450591201
- License:
- Abstract: Reinforcement learning from human feedback (RLHF) has been crucial in aligning large language models (LLMs) with human values. Traditionally, RLHF involves generating responses to a query and using a reward model to assign a reward to the entire response. However, this approach faces challenges due to its reliance on a single, sparse reward, which makes it challenging for the model to identify which parts of the sequence contribute most significantly to the final reward. Recent methods have attempted to address this limitation by introducing token-level rewards. However, these methods often rely on either a trained credit assignment model or AI annotators, raising concerns about the quality and reliability of the rewards. In this paper, we propose token-level reward regularization (T-REG), a novel approach that leverages both sequence-level and token-level rewards for preference optimization. Harnessing the self-refinement capabilities of LLMs, our method uses contrastive prompting to enable LLMs to self-generate token-level rewards. These self-generated rewards then act as reward regularization, guiding the model to more effectively distribute sequence-level rewards across tokens. This facilitates better token-level credit assignment and enhances alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively. We will release the code and models at https://github.com/wzhouad/T-REG.
Related papers
- Process Reinforcement through Implicit Rewards [95.7442934212076]
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs)
Dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards.
This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive.
We propose PRIME, which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards
arXiv Detail & Related papers (2025-02-03T15:43:48Z) - Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model [96.20350225621813]
Reinforcement learning from human feedback (RLHF) has been widely adopted to align language models (LMs) with human preference.
In this paper, we seek to get the best of both by training and utilizing a segment-level reward model.
arXiv Detail & Related papers (2025-01-06T06:17:56Z) - R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Optimal Design for Reward Modeling in RLHF [83.3614658277817]
We formalize the reward training model in Reinforcement Learning from Human Feedback.
We frame the selection of an effective dataset as a simple regret minimization task.
We derive bounds on the simple regret under appropriate assumptions.
arXiv Detail & Related papers (2024-10-22T14:36:44Z) - A Critical Look At Tokenwise Reward-Guided Text Generation [23.908449840589284]
We show that reward models trained on full sequences are not compatible with scoring partial sequences.
We propose to train a Bradley-Terry reward model on partial sequences explicitly, and autoregressively sample from the implied tokenwise policy during decoding time.
arXiv Detail & Related papers (2024-06-12T00:19:40Z) - Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning [49.87923965553233]
Reinforcement Learning can lead to reward over-optimization in large language models.
We introduce the Reward from Demonstration (RCfD) to recalibrate the reward objective.
We show that RCfD achieves comparable performance to carefully tuned baselines while mitigating ROO.
arXiv Detail & Related papers (2024-04-30T09:57:21Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.