Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
- URL: http://arxiv.org/abs/2405.18641v5
- Date: Tue, 29 Oct 2024 05:46:55 GMT
- Title: Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
- Authors: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu,
- Abstract summary: Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data.
We show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets.
We propose textbfLazy(textbfi) textbfalignment (textbfLisa), which introduces a proximal term to constraint the drift of each state.
- Score: 7.945893812374361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa's convergence. Empirically, our results on four downstream finetuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM's accuracy on the user tasks. Code is available at \url{https://github.com/git-disl/Lisa}.
Related papers
- TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs [7.125400292079228]
Large Language Models (LLMs) fine-tuned to align with human values often exhibit alignment drift.<n>While prior work has behaviorally characterized alignment failure, little is known about the training-time belief sources underlying these failures.<n>We introduce TraceAlign, a unified framework for tracing unsafe completions back to their root causes in the model's training corpus.
arXiv Detail & Related papers (2025-08-04T05:03:35Z) - Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets [64.96967819446553]
This paper investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks.<n>High similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks.<n>Low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%.
arXiv Detail & Related papers (2025-06-05T17:59:55Z) - Towards safe Bayesian optimization with Wiener kernel regression [0.6554326244334868]
We present a novel error bound based on the recently proposed Wiener kernel regression.<n>We prove that under rather mild assumptions, the proposed error bound is tighter than bounds previously documented in the literature.<n>We draw upon a numerical example to demonstrate the efficacy of the proposed error bound in safe BO.
arXiv Detail & Related papers (2024-11-04T16:43:16Z) - Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z) - Robust Barycenter Estimation using Semi-Unbalanced Neural Optimal Transport [84.51977664336056]
We propose a novel, scalable approach for estimating the textitrobust continuous barycenter.
Our method is framed as a $min$-$max$ optimization problem and is adaptable to textitgeneral cost function.
arXiv Detail & Related papers (2024-10-04T23:27:33Z) - TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models [14.411646409316624]
We introduce textbfHierarchical textbfText-textbfFree textbfAlignment (textbfTS-HTFA), a novel method for time-series forecasting.
We replace paired text data with adaptive virtual text based on QR decomposition word embeddings and learnable prompt.
Experiments on multiple time-series benchmarks demonstrate that HTFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-23T12:57:24Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - The Poison of Alignment [0.0]
We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment.
We demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks.
arXiv Detail & Related papers (2023-08-25T15:51:15Z) - WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised
Span Prediction [31.96433679860807]
Most existing word alignment methods rely on manual alignment datasets or parallel corpora.
We relax the requirement for correct, fully-aligned, and parallel sentences.
We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction.
arXiv Detail & Related papers (2023-06-09T03:11:42Z) - MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL)
We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately.
Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z) - The Devil is in the Margin: Margin-based Label Smoothing for Network
Calibration [21.63888208442176]
In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated.
We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses.
We propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances.
arXiv Detail & Related papers (2021-11-30T14:21:47Z) - Feature Space Targeted Attacks by Statistic Alignment [74.40447383387574]
Feature space targeted attacks perturb images by modulating their intermediate feature maps.
The current choice of pixel-wise Euclidean Distance to measure the discrepancy is questionable because it unreasonably imposes a spatial-consistency constraint on the source and target features.
We propose two novel approaches called Pair-wise Alignment Attack and Global-wise Alignment Attack, which attempt to measure similarities between feature maps by high-order statistics.
arXiv Detail & Related papers (2021-05-25T03:46:39Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.