Related papers: Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

URL: http://arxiv.org/abs/2310.08164v4
Date: Wed, 7 Feb 2024 11:13:15 GMT
Title: Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models
Authors: Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, Philip Torr, Fazl Barez
Abstract summary: Large language models (LLMs) fine-tuned by reinforcement learning from human feedback are becoming more widely deployed. We coin the term $textitImplicit Reward Model$ (IRM) to refer to the changes that occur to an LLM that result in high-reward generations.
Score: 8.15890412446096
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced them. By fitting a linear function to an LLM's IRM, a reward model with the same type signature as the RLHF reward model is constructed, allowing for direct comparison. Additionally, we validate our construction of the IRM through cross-comparison with classifications of features generated by an LLM based on their relevance to the RLHF reward model. Better comprehending IRMs can help minimize discrepencies between LLM behavior and training objectives, which we believe to be an essential component of the $\textit{safety}$ and $\textit{alignment}$ of LLMs.

Related papers

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [23.27203570485055]
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models with human preferences. We propose a two-stage approach to train a reward model (RM) with multi-dimensional absolute-rating data. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow on top of the ArmoRM.
arXiv Detail & Related papers (2024-06-18T17:58:28Z)
Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning [49.87923965553233]
Reinforcement Learning can lead to reward over-optimization in large language models. We introduce the Reward from Demonstration (RCfD) to recalibrate the reward objective. We show that RCfD achieves comparable performance to carefully tuned baselines while mitigating ROO.
arXiv Detail & Related papers (2024-04-30T09:57:21Z)
Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection [24.435121488662897]
We propose a novel framework: Reinforcement Learning from Reflective Feedback (RLRF) RLRF employs a self-reflection mechanism to systematically explore and refine LLM responses, then fine-tuning the models via a RL algorithm along with promising responses. Our experiments across Just-Eval, Factuality, and Mathematical Reasoning demonstrate the efficacy and transformative potential of RLRF.
arXiv Detail & Related papers (2024-03-21T08:57:27Z)
Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy [47.327200425168314]
Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure Large Language Models (LLMs) align with human values. We introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs. Our method achieves a comparable level of alignment with only 1% of the training parameters of other methods.
arXiv Detail & Related papers (2024-03-07T07:31:00Z)
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data. We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z)
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z)
Language Reward Modulation for Pretraining Reinforcement Learning [61.76572261146311]
We propose leveraging the capabilities of LRFs as a pretraining signal for reinforcement learning. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks.
arXiv Detail & Related papers (2023-08-23T17:37:51Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [126.78737228677025]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons [79.98542868281473]
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF) We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
arXiv Detail & Related papers (2023-01-26T18:07:21Z)
Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.