SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
- URL: http://arxiv.org/abs/2410.07471v2
- Date: Fri, 11 Oct 2024 01:05:22 GMT
- Title: SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection
- Authors: Han Shen, Pin-Yu Chen, Payel Das, Tianyi Chen,
- Abstract summary: SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones.
Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection.
- Score: 92.38300626647342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models. Our code is available on github https://github.com/hanshen95/SEAL.
Related papers
- Efficient Safety Retrofitting Against Jailbreaking for LLMs [0.4711628883579317]
Direct Preference Optimization (DPO) is an efficient alignment technique that steers LLMs towards preferable outputs by training on preference data.
This paper examines DPO's effectiveness in model safety against jailbreaking attacks while minimizing data requirements and training costs.
arXiv Detail & Related papers (2025-02-19T10:33:18Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation [71.46236155101032]
We propose Base-Refine, a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models.
We show that fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.
arXiv Detail & Related papers (2025-02-03T00:12:40Z) - Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging [43.44112117935541]
Fine-tuning large language models (LLMs) for downstream tasks often leads to safety degradation in safety-aligned LLMs.
We propose a method that maintains the inherent safety of LLMs while enhancing their downstream task performance.
arXiv Detail & Related papers (2024-12-27T08:03:22Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models [39.56233272612982]
Current vision large language models (VLLMs) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to jailbreaking attacks.
Our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning.
To address this issue, we first curate a vision-language safe instruction-following dataset VLGuard covering various harmful categories.
arXiv Detail & Related papers (2024-02-03T16:43:42Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Safer-Instruct: Aligning Language Models with Automated Preference Data [20.177660013450176]
Reinforcement learning from human feedback is a vital strategy for enhancing model capability in language models.
We present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data.
Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data.
arXiv Detail & Related papers (2023-11-15T04:22:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.