Related papers: Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

URL: http://arxiv.org/abs/2409.01586v2
Date: Wed, 4 Sep 2024 19:30:59 GMT
Title: Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
Authors: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu,
Abstract summary: Harmful fine-tuning issue citepqi2023fine poses serious safety concerns for Large language models' fine-tuning-as-a-service. We propose an alignment-stage solution, dubbed Booster, to mitigate the issue.
Score: 7.945893812374361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{https://github.com/git-disl/Booster}.

Related papers

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts. We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning [7.9447287301860445]
Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks citeqi2023fine-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. Existing mitigation strategies include alignment stage solutions citehuang2024vaccine, rosati2024representation and fine-tuning stage solutions citehuang2024lazy,mukhoti2023fine. We propose Antidote, a post-fine-tuning stage solution, which remains textbftextitagnostic to
arXiv Detail & Related papers (2024-08-18T21:45:03Z)
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack [7.945893812374361]
Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. We show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. We propose textbfLazy(textbfi) textbfalignment (textbfLisa), which introduces a proximal term to constraint the drift of each state.
arXiv Detail & Related papers (2024-05-28T22:53:43Z)
Representation Noising: A Defence Mechanism Against Harmful Finetuning [28.451676139178687]
Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. We propose Representation Noising (RepNoise), a defence mechanism that operates even when attackers have access to the weights.
arXiv Detail & Related papers (2024-05-23T13:51:55Z)
Extreme Miscalibration and the Illusion of Adversarial Robustness [66.29268991629085]
Adversarial Training is often used to increase model robustness. We show that this observed gain in robustness is an illusion of robustness (IOR) We urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations.
arXiv Detail & Related papers (2024-02-27T13:49:12Z)
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack [7.653580388741887]
A few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning.
arXiv Detail & Related papers (2024-02-02T02:56:50Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks. We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z)
The Poison of Alignment [0.0]
We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment. We demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks.
arXiv Detail & Related papers (2023-08-25T15:51:15Z)
Label Noise: Correcting the Forward-Correction [0.0]
Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels. We propose an approach to tackling overfitting caused by label noise. Motivated by this observation, we propose imposing a lower bound on the training loss to mitigate overfitting.
arXiv Detail & Related papers (2023-07-24T19:41:19Z)
STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection [80.04000067312428]
We propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity. We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity. We also propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence.
arXiv Detail & Related papers (2023-06-05T10:33:25Z)
PTP: Boosting Stability and Performance of Prompt Tuning with Perturbation-Based Regularizer [94.23904400441957]
We introduce perturbation-based regularizers, which can smooth the loss landscape, into prompt tuning. We design two kinds of perturbation-based regularizers, including random-noise-based and adversarial-based. Our new algorithms improve the state-of-the-art prompt tuning methods by 1.94% and 2.34% on SuperGLUE and FewGLUE benchmarks, respectively.
arXiv Detail & Related papers (2023-05-03T20:30:51Z)
Towards the Semantic Weak Generalization Problem in Generative Zero-Shot Learning: Ante-hoc and Post-hoc [89.68803484284408]
We present a simple and effective strategy lowering the previously unexplored factors that limit the performance ceiling of generative Zero-Shot Learning (ZSL) We begin by formally defining semantic generalization, then look into approaches for reducing the semantic weak generalization problem. In the ante-hoc phase, we augment the generator's semantic input, as well as relax the fitting target of the generator.
arXiv Detail & Related papers (2022-04-24T13:54:42Z)
Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling [49.06391831200667]
We study the effect of the proposed regularization on both model distribution and decoding performance. We conclude that the high degree of oversmoothing is the main reason behind the case of overly probable short sequences in a neural autoregressive model.
arXiv Detail & Related papers (2021-12-16T14:33:12Z)
A Perceptual Distortion Reduction Framework for Adversarial Perturbation Generation [58.6157191438473]
We propose a perceptual distortion reduction framework to tackle this problem from two perspectives. We propose a perceptual distortion constraint and add it into the objective function of adversarial attack to jointly optimize the perceptual distortions and attack success rate.
arXiv Detail & Related papers (2021-05-01T15:08:10Z)
Calibrated Surrogate Losses for Adversarially Robust Classification [92.37268323142307]
We show that no convex surrogate loss is respect with respect to adversarial 0-1 loss when restricted to linear models. We also show that if the underlying distribution satisfies the Massart's noise condition, convex losses can also be calibrated in the adversarial setting.
arXiv Detail & Related papers (2020-05-28T02:40:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.