Curiosity-Driven Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2501.11463v1
- Date: Mon, 20 Jan 2025 12:51:40 GMT
- Title: Curiosity-Driven Reinforcement Learning from Human Feedback
- Authors: Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang,
- Abstract summary: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, but often at the cost of reduced output diversity.
We introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards.
We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following.
- Score: 56.45486828254951
- License:
- Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.
Related papers
- Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits.
We decompose the RLHF process into two distinct stages: (post-)training and deployment.
We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z) - Low-Rank Contextual Reinforcement Learning from Heterogeneous Human Feedback [12.373566593905792]
Reinforcement learning from human feedback (RLHF) has become a cornerstone for aligning large language models with human preferences.
We propose a Low-rank Contextual RLHF framework that integrates contextual information to better model heterogeneous feedback.
Our approach builds on a contextual preference model, leveraging the intrinsic low-rank structure of the interaction between user contexts and query-answer pairs.
arXiv Detail & Related papers (2024-12-27T04:02:46Z) - Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.
Existing methods work by emulating the preferences at the single decision (turn) level.
We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z) - Provable Multi-Party Reinforcement Learning with Diverse Human Feedback [63.830731470186855]
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences.
We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals.
We incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties.
arXiv Detail & Related papers (2024-03-08T03:05:11Z) - MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback [43.51441849047147]
We introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF.
Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations.
arXiv Detail & Related papers (2024-02-04T09:40:22Z) - Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization [13.436983663467938]
This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that progressively infers diversity metrics from human judgments of similarity among solutions.
Empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery.
In open-ended generative tasks, QDHF substantially enhances the diversity of text-to-image generation from a diffusion model.
arXiv Detail & Related papers (2023-10-18T16:46:16Z) - A Long Way to Go: Investigating Length Correlations in RLHF [59.49656695716066]
This paper demonstrates, on three diverse settings, that optimizing for response length is a significant factor behind RLHF.
We find improvements in reward to largely be driven by increasing response length, instead of other features.
Even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models.
arXiv Detail & Related papers (2023-10-05T17:38:28Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.