SLiC-HF: Sequence Likelihood Calibration with Human Feedback
- URL: http://arxiv.org/abs/2305.10425v1
- Date: Wed, 17 May 2023 17:57:10 GMT
- Title: SLiC-HF: Sequence Likelihood Calibration with Human Feedback
- Authors: Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh,
Peter J. Liu
- Abstract summary: We show how the recently introduced Sequence Likelihood (SLiC) can also be used to effectively learn from human preferences.
Experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines.
- Score: 35.74135968442311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning from human feedback has been shown to be effective at aligning
language models with human preferences. Past work has often relied on
Reinforcement Learning from Human Feedback (RLHF), which optimizes the language
model using reward scores assigned from a reward model trained on human
preference data. In this work we show how the recently introduced Sequence
Likelihood Calibration (SLiC), can also be used to effectively learn from human
preferences (SLiC-HF). Furthermore, we demonstrate this can be done with human
feedback data collected for a different model, similar to off-policy, offline
RL data. Automatic and human evaluation experiments on the TL;DR summarization
task show that SLiC-HF significantly improves supervised fine-tuning baselines.
Furthermore, SLiC-HF presents a competitive alternative to the PPO RLHF
implementation used in past work while being much simpler to implement, easier
to tune and more computationally efficient in practice.
Related papers
- Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences.
We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself.
Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values.
It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective.
This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z) - Sample Efficient Reinforcement Learning from Human Feedback via Active
Exploration [29.935758027209292]
Preference-based feedback is important for many applications in reinforcement learning.
In this work, we take advantage of the fact that one can often choose contexts to obtain human feedback.
We show that our method is able to reach better performance with fewer samples of human preferences than multiple baselines.
arXiv Detail & Related papers (2023-12-01T00:54:02Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z) - Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback [8.409764908043396]
We apply preference modeling and reinforcement learning from human feedback to finetune language models to act as helpful assistants.
We find this alignment training improves performance on almost all NLP evaluations.
We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data.
arXiv Detail & Related papers (2022-04-12T15:02:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.