Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
- URL: http://arxiv.org/abs/2408.15313v1
- Date: Tue, 27 Aug 2024 17:31:21 GMT
- Title: Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
- Authors: Wenxuan Zhang, Philip H. S. Torr, Mohamed Elhoseiny, Adel Bibi,
- Abstract summary: Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
- Score: 94.39278422567955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.
Related papers
- Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models [24.168387024091082]
Fine-tuning large language models (LLMs) based on human preferences has been effective in improving their performance.
Maintaining safety throughout the fine-tuning process remains a significant challenge.
We propose an Equilibrate RLHF framework that achieves better safety alignment even with fewer training data.
arXiv Detail & Related papers (2025-02-17T08:40:30Z) - Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation.
LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs.
This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Enhancing LLM Safety via Constrained Direct Preference Optimization [8.22888921018027]
We introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning AI systems.
By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning.
Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint.
arXiv Detail & Related papers (2024-03-04T20:39:24Z) - Safe RLHF: Safe Reinforcement Learning from Human Feedback [16.69413517494355]
We propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment.
Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension.
We demonstrate a superior ability to mitigate harmful responses while enhancing model performance.
arXiv Detail & Related papers (2023-10-19T14:22:03Z) - Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence.
reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z) - Provably Efficient Iterated CVaR Reinforcement Learning with Function
Approximation and Human Feedback [57.6775169085215]
Risk-sensitive reinforcement learning aims to optimize policies that balance the expected reward and risk.
We present a novel framework that employs an Iterated Conditional Value-at-Risk (CVaR) objective under both linear and general function approximations.
We propose provably sample-efficient algorithms for this Iterated CVaR RL and provide rigorous theoretical analysis.
arXiv Detail & Related papers (2023-07-06T08:14:54Z) - A Multiplicative Value Function for Safe and Efficient Reinforcement
Learning [131.96501469927733]
We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic.
The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns.
We evaluate our method in four safety-focused environments, including classical RL benchmarks augmented with safety constraints and robot navigation tasks with images and raw Lidar scans as observations.
arXiv Detail & Related papers (2023-03-07T18:29:15Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.