Safe RLHF: Safe Reinforcement Learning from Human Feedback
- URL: http://arxiv.org/abs/2310.12773v1
- Date: Thu, 19 Oct 2023 14:22:03 GMT
- Title: Safe RLHF: Safe Reinforcement Learning from Human Feedback
- Authors: Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu,
Yizhou Wang, Yaodong Yang
- Abstract summary: We propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment.
Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension.
We demonstrate a superior ability to mitigate harmful responses while enhancing model performance.
- Score: 16.69413517494355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of large language models (LLMs), striking a balance
between the performance and safety of AI systems has never been more critical.
However, the inherent tension between the objectives of helpfulness and
harmlessness presents a significant challenge during LLM training. To address
this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe
RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly
decouples human preferences regarding helpfulness and harmlessness, effectively
avoiding the crowdworkers' confusion about the tension and allowing us to train
separate reward and cost models. We formalize the safety concern of LLMs as an
optimization task of maximizing the reward function while satisfying specified
cost constraints. Leveraging the Lagrangian method to solve this constrained
problem, Safe RLHF dynamically adjusts the balance between the two objectives
during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we
demonstrate a superior ability to mitigate harmful responses while enhancing
model performance compared to existing value-aligned algorithms.
Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with
collected human preferences, significantly improving its helpfulness and
harmlessness according to human evaluations.
Related papers
- Trustworthy Human-AI Collaboration: Reinforcement Learning with Human Feedback and Physics Knowledge for Safe Autonomous Driving [1.5361702135159845]
Reinforcement Learning with Human Feedback (RLHF) has attracted substantial attention due to its potential to enhance training safety and sampling efficiency.
Inspired by the human learning process, we propose Physics-enhanced Reinforcement Learning with Human Feedback (PE-RLHF)
PE-RLHF guarantees the learned policy will perform at least as well as the given physics-based policy, even when human feedback quality deteriorates.
arXiv Detail & Related papers (2024-09-01T22:20:32Z) - Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Enhancing LLM Safety via Constrained Direct Preference Optimization [8.22888921018027]
We introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning AI systems.
By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning.
Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint.
arXiv Detail & Related papers (2024-03-04T20:39:24Z) - Uncertainty-Penalized Reinforcement Learning from Human Feedback with
Diverse Reward LoRA Ensembles [26.955375398765085]
Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs)
In this paper, we observe the weakness of KL regularization which is commonly employed in existing RLHF methods to address overoptimization.
We propose uncertainty-penalized RLHF (UP-RLHF), which incorporates uncertainty regularization during RL-finetuning.
arXiv Detail & Related papers (2023-12-30T14:14:14Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences.
Despite its advantages, RLHF relies on human annotators to rank the text.
We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z) - Provably Efficient Iterated CVaR Reinforcement Learning with Function
Approximation and Human Feedback [57.6775169085215]
Risk-sensitive reinforcement learning aims to optimize policies that balance the expected reward and risk.
We present a novel framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations.
We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis.
arXiv Detail & Related papers (2023-07-06T08:14:54Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.