Measuring memorization in RLHF for code completion
- URL: http://arxiv.org/abs/2406.11715v2
- Date: Fri, 25 Oct 2024 10:40:25 GMT
- Title: Measuring memorization in RLHF for code completion
- Authors: Aneesh Pappu, Billy Porter, Ilia Shumailov, Jamie Hayes,
- Abstract summary: Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences.
We analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference learning.
Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning large language models.
- Score: 18.3607188787591
- License:
- Abstract: Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In addition to RLHF, other methods such as Direct Preference Optimization (DPO) and $\Psi$PO have gained popularity for learning directly from human preferences, removing the need for optimizing intermediary reward models with reinforcement learning. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF and direct preference learning. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized in comparison to directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF. In contrast, we find that aligning by learning directly from human preference data via a special case of $\Psi$PO, Identity Preference Optimization (IPO), increases the likelihood that training data is regurgitated compared to RLHF. Our work suggests that RLHF, as opposed to direct preference learning, is a safer way to mitigate the risk of regurgitating sensitive preference data when aligning large language models. We find our conclusions are robust across multiple code completion datasets, tasks, and model scales.
Related papers
- RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models.
RLHF is know to exploit biases in human preferences, such as verbosity.
We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values.
It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective.
This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z) - The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings.
In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z) - A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF.
We find improvements in reward to largely be driven by increasing response length, instead of other features.
Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z) - SLiC-HF: Sequence Likelihood Calibration with Human Feedback [35.74135968442311]
We show how the recently introduced Sequence Likelihood (SLiC) can also be used to effectively learn from human preferences.
Experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines.
arXiv Detail & Related papers (2023-05-17T17:57:10Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.