Understanding Impact of Human Feedback via Influence Functions
- URL: http://arxiv.org/abs/2501.05790v1
- Date: Fri, 10 Jan 2025 08:50:38 GMT
- Title: Understanding Impact of Human Feedback via Influence Functions
- Authors: Taywon Min, Haeone Lee, Hanho Ryu, Yongchan Kwon, Kimin Lee,
- Abstract summary: In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback.
Human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses.
We propose a compute-efficient approximation method to measure the impact of human feedback on the performance of reward models.
- Score: 25.467337374024197
- License:
- Abstract: In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. In our experiments, we demonstrate two key applications of influence functions: (1) detecting common forms of labeler bias in human feedback datasets and (2) guiding labelers to refine their strategies to align more closely with expert feedback. By quantifying the impact of human feedback on reward models, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF
Related papers
- Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework [13.949126295663328]
We bridge the gap between machine learning and human-computer interaction efforts by developing a shared understanding of human feedback in interactive learning scenarios.
We introduce a taxonomy of feedback types for reward-based learning from human feedback based on nine key dimensions.
We identify seven quality metrics of human feedback influencing both the human ability to express feedback and the agent's ability to learn from the feedback.
arXiv Detail & Related papers (2024-11-18T17:40:42Z) - Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models [8.025808955214957]
This paper studies the advantages and limitations of reinforcement learning from large language model feedback.
We propose a simple yet effective method for soliciting and applying feedback as a potential-based shaping function.
arXiv Detail & Related papers (2024-10-22T19:52:08Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback [16.540715313676994]
We show that when human feedback is based only on partial observations, it can result in deceptive inflation and overjustification.
We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity.
arXiv Detail & Related papers (2024-02-27T18:32:11Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - RLHF-Blender: A Configurable Interactive Interface for Learning from
Diverse Human Feedback [9.407901608317895]
We propose RLHF-Blender, an interactive interface for learning from human feedback.
RLHF-Blender provides a modular experimentation framework that enables researchers to investigate the properties and qualities of human feedback.
We discuss a set of concrete research opportunities enabled by RLHF-Blender.
arXiv Detail & Related papers (2023-08-08T15:21:30Z) - Fine-Grained Human Feedback Gives Better Rewards for Language Model
Training [108.25635150124539]
Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs.
We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects.
arXiv Detail & Related papers (2023-06-02T17:11:37Z) - SLiC-HF: Sequence Likelihood Calibration with Human Feedback [35.74135968442311]
We show how the recently introduced Sequence Likelihood (SLiC) can also be used to effectively learn from human preferences.
Experiments on the TL;DR summarization task show that SLiC-HF significantly improves supervised fine-tuning baselines.
arXiv Detail & Related papers (2023-05-17T17:57:10Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - FastIF: Scalable Influence Functions for Efficient Model Interpretation
and Debugging [112.19994766375231]
Influence functions approximate the 'influences' of training data-points for test predictions.
We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time.
Our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.
arXiv Detail & Related papers (2020-12-31T18:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.