Aligning Large Language Models by On-Policy Self-Judgment
- URL: http://arxiv.org/abs/2402.11253v3
- Date: Tue, 25 Jun 2024 13:39:52 GMT
- Title: Aligning Large Language Models by On-Policy Self-Judgment
- Authors: Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu,
- Abstract summary: Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
- Score: 49.31895979525054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. In this paper, we present a novel alignment framework, SELF-JUDGE that (1) does on-policy learning and 2) is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model to act as both a policy and a judge. Specifically, we view the pairwise judgment task, choosing the better response from a response pair, as a special case of the instruction-following task. The resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that the rejecting sampling by itself can improve performance further without an additional evaluator.
Related papers
- Step-level Value Preference Optimization for Mathematical Reasoning [6.318873143509028]
We introduce a novel algorithm called Step-level Value Preference Optimization (SVPO)
Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning.
From the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model.
arXiv Detail & Related papers (2024-06-16T09:06:17Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
State-of-the-art techniques such as Reinforcement Learning from Human Feedback (RLHF) often consist of two stages.
1) supervised fine-tuning (SFT), where the model is fine-tuned by learning from human demonstration data.
2) Preference learning, where preference data is used to learn a reward model, which is in turn used by a reinforcement learning step to fine-tune the model.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Adversarial Batch Inverse Reinforcement Learning: Learn to Reward from
Imperfect Demonstration for Interactive Recommendation [23.048841953423846]
We focus on the problem of learning to reward, which is fundamental to reinforcement learning.
Previous approaches either introduce additional procedures for learning to reward, thereby increasing the complexity of optimization.
We propose a novel batch inverse reinforcement learning paradigm that achieves the desired properties.
arXiv Detail & Related papers (2023-10-30T13:43:20Z) - Evaluating the Fairness of Discriminative Foundation Models in Computer
Vision [51.176061115977774]
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP)
We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy.
Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning.
arXiv Detail & Related papers (2023-10-18T10:32:39Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Model-Based Simulation for Optimising Smart Reply [3.615981646205045]
Smart Reply (SR) systems present a user with a set of replies, of which one can be selected in place of having to type out a response.
Previous work has focused largely on post-hoc diversification, rather than explicitly learning to predict sets of responses.
We present a novel method SimSR, that employs model-based simulation to discover high-value response sets.
arXiv Detail & Related papers (2023-05-26T12:04:33Z) - Small Changes Make Big Differences: Improving Multi-turn Response
Selection \\in Dialogue Systems via Fine-Grained Contrastive Learning [27.914380392295815]
Retrieve-based dialogue response selection aims to find a proper response from a candidate set given a multi-turn context.
We propose a novel textbfFine-textbfGrained textbfContrastive (FGC) learning method for the response selection task based on PLMs.
arXiv Detail & Related papers (2021-11-19T11:07:07Z) - Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks.
Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL.
Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.