Related papers: Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

URL: http://arxiv.org/abs/2405.18641v5
Date: Tue, 29 Oct 2024 05:46:55 GMT
Title: Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
Authors: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu,
Abstract summary: Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. We show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. We propose textbfLazy(textbfi) textbfalignment (textbfLisa), which introduces a proximal term to constraint the drift of each state.
Score: 7.945893812374361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa's convergence. Empirically, our results on four downstream finetuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM's accuracy on the user tasks. Code is available at \url{https://github.com/git-disl/Lisa}.

Related papers

Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Robust Barycenter Estimation using Semi-Unbalanced Neural Optimal Transport [84.51977664336056]
We propose a novel, scalable approach for estimating the textitrobust continuous barycenter. Our method is framed as a $min$-$max$ optimization problem and is adaptable to textitgeneral cost function.
arXiv Detail & Related papers (2024-10-04T23:27:33Z)
TS-HTFA: Advancing Time Series Forecasting via Hierarchical Text-Free Alignment with Large Language Models [14.411646409316624]
We introduce textbfHierarchical textbfText-textbfFree textbfAlignment (textbfTS-HTFA), a novel method for time-series forecasting. We replace paired text data with adaptive virtual text based on QR decomposition word embeddings and learnable prompt. Experiments on multiple time-series benchmarks demonstrate that HTFA achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-23T12:57:24Z)
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. This paper introduces a training-free attack method capable of reversing safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)
WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction [31.96433679860807]
Most existing word alignment methods rely on manual alignment datasets or parallel corpora. We relax the requirement for correct, fully-aligned, and parallel sentences. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction.
arXiv Detail & Related papers (2023-06-09T03:11:42Z)
MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL) We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately. Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z)
The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration [21.63888208442176]
In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated. We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses. We propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances.
arXiv Detail & Related papers (2021-11-30T14:21:47Z)
Feature Space Targeted Attacks by Statistic Alignment [74.40447383387574]
Feature space targeted attacks perturb images by modulating their intermediate feature maps. The current choice of pixel-wise Euclidean Distance to measure the discrepancy is questionable because it unreasonably imposes a spatial-consistency constraint on the source and target features. We propose two novel approaches called Pair-wise Alignment Attack and Global-wise Alignment Attack, which attempt to measure similarities between feature maps by high-order statistics.
arXiv Detail & Related papers (2021-05-25T03:46:39Z)
Boosting Continuous Sign Language Recognition via Cross Modality Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair. We propose a novel architecture with cross modality augmentation. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.