SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
- URL: http://arxiv.org/abs/2410.07471v2
- Date: Fri, 11 Oct 2024 01:05:22 GMT
- Title: SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
- Authors: Han Shen, Pin-Yu Chen, Payel Das, Tianyi Chen,
- Abstract summary: SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones.
Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection.
- Score: 92.38300626647342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github https://github.com/hanshen95/SEAL.
Related papers
- Fantastic LLMs for Preference Data Annotation and How to (not) Find Them [15.776175440446414]
Preference tuning of large language models (LLMs) relies on high-quality human preference data.
Existing methods can use trained reward models or proprietary model as judges for preference annotation.
We introduce customized density ratio (CDR) that leverages open-source LLMs for data annotation.
arXiv Detail & Related papers (2024-11-04T18:54:39Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Safer-Instruct: Aligning Language Models with Automated Preference Data [20.177660013450176]
Reinforcement learning from human feedback is a vital strategy for enhancing model capability in language models.
We present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data.
Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data.
arXiv Detail & Related papers (2023-11-15T04:22:22Z) - LoBaSS: Gauging Learnability in Supervised Fine-tuning Data [64.27898739929734]
Supervised Fine-Tuning (SFT) serves as a crucial phase in aligning Large Language Models (LLMs) to specific task prerequisites.
We introduce a new dimension in SFT data selection: learnability.
We present the Loss Based SFT Data Selection (LoBaSS) method, utilizing data learnability as the principal criterion for the selection SFT data.
arXiv Detail & Related papers (2023-10-16T07:26:24Z) - A Comparative Survey of Deep Active Learning [76.04825433362709]
Active Learning (AL) is a set of techniques for reducing labeling cost by sequentially selecting data samples from a large unlabeled data pool for labeling.
Deep Learning (DL) is data-hungry, and the performance of DL models scales monotonically with more training data.
In recent years, Deep Active Learning (DAL) has risen as feasible solutions for maximizing model performance while minimizing the expensive labeling cost.
arXiv Detail & Related papers (2022-03-25T05:17:24Z) - Bayesian Active Learning with Pretrained Language Models [9.161353418331245]
Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data.
Previous AL approaches have been limited to task-specific models that are trained from scratch at each iteration.
We introduce BALM; Bayesian Active Learning with pretrained language models.
arXiv Detail & Related papers (2021-04-16T19:07:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.