Related papers: Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

URL: http://arxiv.org/abs/2409.06957v1
Date: Wed, 11 Sep 2024 02:40:38 GMT
Title: Policy Filtration in RLHF to Fine-Tune LLM for Code Generation
Authors: Wei Shen, Chuheng Zhang,
Abstract summary: Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide harmless responses. While direct policy optimization methods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) in RLHF to train the policy to generate good responses guided by a reward model learned from preference data. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve signal-to-noise ratio during policy learning
Score: 13.2216273705657
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning from human feedback (RLHF) is one of the key techniques that helps large language models (LLMs) to follow instructions and provide helpful and harmless responses. While direct policy optimization methods exist, state-of-the-art LLMs adopt RL-based methods (usually PPO) in RLHF to train the policy to generate good responses guided by a reward model learned from preference data. The main challenge of these methods is the inaccuracy of the intermediate reward model, especially in code generation tasks that require long and complex reasoning to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtration strategy for a given reward model, the coefficient of determination ($R^2$) between rewards and actual scores on filtered samples serves as a good metrics and helps us find several promising strategies. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation tasks, and find that some variants of PF-PPO are highly effective and achieve new state-of-the-art performance across 7-billion-parameter models on HumanEval, MBPP, and a new and more challenging LeetCode Contest benchmark.

Related papers

SPARK: Synergistic Policy And Reward Co-Evolving Framework [84.22494672256894]
We introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR.<n>Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model.<n>We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks.
arXiv Detail & Related papers (2025-09-26T17:50:12Z)
FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z)
Pre-Trained Policy Discriminators are General Reward Models [81.3974586561645]
We propose a scalable pre-training method named Policy Discriminative Learning (POLAR)<n>POLAR trains a reward model (RM) to discern identical policies and discriminate different ones.<n> Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods.
arXiv Detail & Related papers (2025-07-07T16:56:31Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning [11.708197376569016]
Group Relative Policy Optimization ( GRPO) is proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group.<n>It can lead to inaccurate advantage estimates in environments with highly noisy rewards, potentially introducing bias.<n>We propose a model, called Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), by using lightweight Kalman filtering to dynamically estimate the latent reward mean and variance.
arXiv Detail & Related papers (2025-05-12T13:09:49Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
Soft Policy Optimization: Online Off-Policy RL for Sequence Models [42.95110169230739]
Post-training of language models is almost exclusively done using on-policy methods such as PPO. SPO is a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories.
arXiv Detail & Related papers (2025-03-07T14:23:40Z)
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective [31.956232187102465]
This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We propose novel transfer learning principles and a theoretical algorithm with provable benefits compared to standard online learning.
arXiv Detail & Related papers (2025-02-26T16:03:06Z)
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z)
MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples [22.521746860874305]
This study introduces the MPPO algorithm, which leverages the average likelihood of model responses to fit the reward function.<n>Through a comparison of Point-wise, Pair-wise, and List-wise implementations, we found that the Pair-wise approach achieves the best performance.<n> Experimental results demonstrate MPPO's outstanding performance across various benchmarks.
arXiv Detail & Related papers (2024-12-13T14:18:58Z)
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates. We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
The Perfect Blend: Redefining RLHF with Mixture of Judges [68.58426626501883]
Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM) Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. We introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO)
arXiv Detail & Related papers (2024-09-30T15:06:53Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO) Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z)
Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data. We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient. We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study [16.99550556866219]
Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. In academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO) We show that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
arXiv Detail & Related papers (2024-04-16T16:51:53Z)
Fine-Tuning Language Models with Reward Learning on Policy [68.70065254564642]
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. Despite its popularity, (fixed) reward models may suffer from inaccurate off-distribution. We propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.
arXiv Detail & Related papers (2024-03-28T10:02:10Z)
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model. In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO. Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.