Enhancing LLM Safety via Constrained Direct Preference Optimization
- URL: http://arxiv.org/abs/2403.02475v1
- Date: Mon, 4 Mar 2024 20:39:24 GMT
- Title: Enhancing LLM Safety via Constrained Direct Preference Optimization
- Authors: Zixuan Liu, Xiaolin Sun, Zizhan Zheng
- Abstract summary: We introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning AI systems.
By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning.
Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint.
- Score: 8.22888921018027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapidly increasing capabilities of large language models (LLMs) raise an
urgent need to align AI systems with diverse human preferences to
simultaneously enhance their usefulness and safety, despite the often
conflicting nature of these goals. To address this important problem, a
promising approach is to enforce a safety constraint at the fine-tuning stage
through a constrained Reinforcement Learning from Human Feedback (RLHF)
framework. This approach, however, is computationally expensive and often
unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension
of the recently proposed Direct Preference Optimization (DPO) approach for
fine-tuning LLMs that is both efficient and lightweight. By integrating dual
gradient descent and DPO, our method identifies a nearly optimal trade-off
between helpfulness and harmlessness without using reinforcement learning.
Empirically, our approach provides a safety guarantee to LLMs that is missing
in DPO while achieving significantly higher rewards under the same safety
constraint compared to a recently proposed safe RLHF approach.
Warning: This paper contains example data that may be offensive or harmful.
Related papers
- Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization [16.35399722653875]
We propose textbfRectified Policy Optimization (RePO), which replaces the average safety constraint with stricter (per prompt) safety constraints.
At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt.
Our experiments on Alpaca-7B demonstrate that RePO improves the safety alignment and reduces the safety interference compared to baseline methods.
arXiv Detail & Related papers (2024-10-25T19:08:23Z) - Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based scenarios.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA.
textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - ICDPO: Effectively Borrowing Alignment Capability of Others via
In-context Direct Preference Optimization [24.55845271377532]
Large Language Models rely on Human Preference Alignment to ensure the generation of safe content.
We propose a novel approach called In-Context Direct Preference Optimization (ICDPO)
ICDPO generates well-aligned responses as estimated by the aforementioned instant scorer, thereby enhancing the final performance.
arXiv Detail & Related papers (2024-02-14T17:14:34Z) - Safe RLHF: Safe Reinforcement Learning from Human Feedback [16.69413517494355]
We propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment.
Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension.
We demonstrate a superior ability to mitigate harmful responses while enhancing model performance.
arXiv Detail & Related papers (2023-10-19T14:22:03Z) - Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement
Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy.
Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.