Efficient Safety Retrofitting Against Jailbreaking for LLMs
- URL: http://arxiv.org/abs/2502.13603v1
- Date: Wed, 19 Feb 2025 10:33:18 GMT
- Title: Efficient Safety Retrofitting Against Jailbreaking for LLMs
- Authors: Dario Garcia-Gasulla, Anna Arias-Duart, Adrian Tormos, Daniel Hinjos, Oscar Molina-Sedano, Ashwin Kumar Gururajan, Maria Eugenia Cardello,
- Abstract summary: Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data.
This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs.
- Score: 0.4711628883579317
- License:
- Abstract: Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data, bypassing the need for explicit reward models. Its simplicity enables easy adaptation to various domains and safety requirements. This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs. We introduce Egida, a dataset expanded from multiple sources, which includes 27 different safety topics and 18 different attack styles, complemented with synthetic and human labels. This data is used to boost the safety of state-of-the-art LLMs (Llama-3.1-8B/70B-Instruct, Qwen-2.5-7B/72B-Instruct) across topics and attack styles. In addition to safety evaluations, we assess their post-alignment performance degradation in general purpose tasks, and their tendency to over refusal. Following the proposed methodology, trained models reduce their Attack Success Rate by 10%-30%, using small training efforts (2,000 samples) with low computational cost (3\$ for 8B models, 20\$ for 72B models). Safety aligned models generalize to unseen topics and attack styles, with the most successful attack style reaching a success rate around 5%. Size and family are found to strongly influence model malleability towards safety, pointing at the importance of pre-training choices. To validate our findings, a large independent assessment of human preference agreement with Llama-Guard-3-8B is conducted by the authors and the associated dataset Egida-HSafe is released. Overall, this study illustrates how affordable and accessible it is to enhance LLM safety using DPO while outlining its current limitations. All datasets and models are released to enable reproducibility and further research.
Related papers
- Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models [24.168387024091082]
Fine-tuning large language models (LLMs) based on human preferences has been effective in improving their performance.
Maintaining safety throughout the fine-tuning process remains a significant challenge.
We propose an Equilibrate RLHF framework that achieves better safety alignment even with fewer training data.
arXiv Detail & Related papers (2025-02-17T08:40:30Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection [92.38300626647342]
SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones.
Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection.
arXiv Detail & Related papers (2024-10-09T22:24:22Z) - Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models [8.024771725860127]
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms.
We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources.
arXiv Detail & Related papers (2024-10-05T15:10:01Z) - Alignment with Preference Optimization Is All You Need for LLM Safety [5.063347837245749]
We apply various alignment techniques to the Falcon 11B model using safety datasets.
We achieve a significant boost in global safety score as measured by LlamaGuard 3 8B, competing with state-of-the-art models.
However, this safety improvement comes at the cost of reduced general capabilities, particularly in math.
arXiv Detail & Related papers (2024-09-12T06:10:15Z) - SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models [19.486685336959482]
Large language models (LLMs) continue to advance in capability and influence, ensuring their security and preventing harmful outputs has become crucial.
A promising approach to address these concerns involves training models to automatically generate adversarial prompts for red teaming.
We introduce the $mathbfStextelf-mathbfEtextvolving mathbfAtextdversarial mathbfStextafetyety mathbf(SEAS)$ optimization framework, which enhances security by leveraging data generated by the model itself.
arXiv Detail & Related papers (2024-08-05T16:55:06Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models [51.85781332922943]
Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing.
We for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method.
arXiv Detail & Related papers (2024-06-15T13:24:22Z) - Do Gradient Inversion Attacks Make Federated Learning Unsafe? [70.0231254112197]
Federated learning (FL) allows the collaborative training of AI models without needing to share raw data.
Recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data.
In this work, we show that these attacks presented in the literature are impractical in real FL use-cases and provide a new baseline attack.
arXiv Detail & Related papers (2022-02-14T18:33:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.