Related papers: Alignment with Preference Optimization Is All You Need for LLM Safety

Alignment with Preference Optimization Is All You Need for LLM Safety

URL: http://arxiv.org/abs/2409.07772v1
Date: Thu, 12 Sep 2024 06:10:15 GMT
Title: Alignment with Preference Optimization Is All You Need for LLM Safety
Authors: Reda Alami, Ali Khalifa Almansoori, Ahmed Alzubaidi, Mohamed El Amine Seddik, Mugariya Farooq, Hakim Hacid,
Abstract summary: We apply various alignment techniques to the Falcon 11B model using safety datasets. We achieve a significant boost in global safety score as measured by LlamaGuard 3 8B, competing with state-of-the-art models. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math.
Score: 5.063347837245749
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We demonstrate that preference optimization methods can effectively enhance LLM safety. Applying various alignment techniques to the Falcon 11B model using safety datasets, we achieve a significant boost in global safety score (from $57.64\%$ to $99.90\%$) as measured by LlamaGuard 3 8B, competing with state-of-the-art models. On toxicity benchmarks, average scores in adversarial settings dropped from over $0.6$ to less than $0.07$. However, this safety improvement comes at the cost of reduced general capabilities, particularly in math, suggesting a trade-off. We identify noise contrastive alignment (Safe-NCA) as an optimal method for balancing safety and performance. Our study ultimately shows that alignment techniques can be sufficient for building safe and robust models.

Related papers

Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs [4.580092836731863]
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enhances adaptability while reducing computational costs.<n>Existing safety alignment methods struggle to capture complex parameter shifts, leading to suboptimal safety-utility trade-offs.<n>We propose Safe Pruning LoRA (SPLoRA), a novel pruning-based approach that selectively removes LoRA layers that weaken safety alignment.
arXiv Detail & Related papers (2025-06-21T14:59:54Z)
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment [4.181987990532721]
Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility.<n>We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model.<n>DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost.
arXiv Detail & Related papers (2025-05-30T19:11:52Z)
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety [57.14003339251827]
We introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning.<n>As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning.<n>We demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms.
arXiv Detail & Related papers (2025-05-26T14:50:01Z)
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models [34.66687625996389]
Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? We propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimize helpfulness and safety.
arXiv Detail & Related papers (2025-03-22T07:40:20Z)
Efficient Safety Retrofitting Against Jailbreaking for LLMs [0.4711628883579317]
Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs.
arXiv Detail & Related papers (2025-02-19T10:33:18Z)
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection [92.38300626647342]
SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection.
arXiv Detail & Related papers (2024-10-09T22:24:22Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern. We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints [15.904640266226023]
We design a safety model that performs credit assignment to assess contributions of partial state-action trajectories on safety. We derive an effective algorithm for optimizing a safe policy using the learned safety model. We devise a method to dynamically adapt the tradeoff coefficient between safety reward and safety compliance.
arXiv Detail & Related papers (2024-05-05T17:27:22Z)
Safety Optimized Reinforcement Learning via Multi-Objective Policy Optimization [3.425378723819911]
Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints. In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced.
arXiv Detail & Related papers (2024-02-23T08:58:38Z)
Iterative Reachability Estimation for Safe Reinforcement Learning [23.942701020636882]
We propose a new framework, Reachability Estimation for Safe Policy Optimization (RESPO), for safety-constrained reinforcement learning (RL) environments. In the feasible set where there exist violation-free policies, we optimize for rewards while maintaining persistent safety. We evaluate the proposed methods on a diverse suite of safe RL environments from Safety Gym, PyBullet, and MuJoCo.
arXiv Detail & Related papers (2023-09-24T02:36:42Z)
Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization [15.483557012655927]
We propose an algorithm named Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) to strike a balance between the exploration efficiency and the constraints satisfaction. Our method gains remarkable performance improvement under the same cost limit compared with baselines.
arXiv Detail & Related papers (2023-02-28T06:16:34Z)
Meta-Learning Priors for Safe Bayesian Optimization [72.8349503901712]
We build on a meta-learning algorithm, F-PACOH, capable of providing reliable uncertainty quantification in settings of data scarcity. As core contribution, we develop a novel framework for choosing safety-compliant priors in a data-riven manner. On benchmark functions and a high-precision motion system, we demonstrate that our meta-learned priors accelerate the convergence of safe BO approaches.
arXiv Detail & Related papers (2022-10-03T08:38:38Z)
Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial. Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size. We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
Chance-Constrained Trajectory Optimization for Safe Exploration and Learning of Nonlinear Systems [81.7983463275447]
Learning-based control algorithms require data collection with abundant supervision for training. We present a new approach for optimal motion planning with safe exploration that integrates chance-constrained optimal control with dynamics learning and feedback control.
arXiv Detail & Related papers (2020-05-09T05:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.