Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
- URL: http://arxiv.org/abs/2409.01586v2
- Date: Wed, 4 Sep 2024 19:30:59 GMT
- Title: Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
- Authors: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu,
- Abstract summary: Harmful fine-tuning issue citepqi2023fine poses serious safety concerns for Large language models' fine-tuning-as-a-service.
We propose an alignment-stage solution, dubbed Booster, to mitigate the issue.
- Score: 7.945893812374361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Harmful fine-tuning issue \citep{qi2023fine} poses serious safety concerns for Large language models' fine-tuning-as-a-service. While existing defenses \citep{huang2024vaccine,rosati2024representation} have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. For the first time in the literature, we in this paper show that \textit{harmful perturbation} over the model weights should be the root cause of alignment-broken of harmful fine-tuning. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction before/after simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at \url{https://github.com/git-disl/Booster}.
Related papers
- Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning [7.9447287301860445]
Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks citeqi2023fine-- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment.
Existing mitigation strategies include alignment stage solutions citehuang2024vaccine, rosati2024representation and fine-tuning stage solutions citehuang2024lazy,mukhoti2023fine.
We propose Antidote, a post-fine-tuning stage solution, which remains textbftextitagnostic to
arXiv Detail & Related papers (2024-08-18T21:45:03Z) - Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack [7.945893812374361]
Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data.
We show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets.
We propose textbfLazy(textbfi) textbfalignment (textbfLisa), which introduces a proximal term to constraint the drift of each state.
arXiv Detail & Related papers (2024-05-28T22:53:43Z) - Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack [7.653580388741887]
A few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model.
We propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning.
arXiv Detail & Related papers (2024-02-02T02:56:50Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - The Poison of Alignment [0.0]
We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment.
We demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks.
arXiv Detail & Related papers (2023-08-25T15:51:15Z) - Label Noise: Correcting the Forward-Correction [0.0]
Training neural network classifiers on datasets with label noise poses a risk of overfitting them to the noisy labels.
We propose an approach to tackling overfitting caused by label noise.
Motivated by this observation, we propose imposing a lower bound on the training loss to mitigate overfitting.
arXiv Detail & Related papers (2023-07-24T19:41:19Z) - STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection [80.04000067312428]
We propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity.
We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity.
We also propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence.
arXiv Detail & Related papers (2023-06-05T10:33:25Z) - PTP: Boosting Stability and Performance of Prompt Tuning with
Perturbation-Based Regularizer [94.23904400441957]
We introduce perturbation-based regularizers, which can smooth the loss landscape, into prompt tuning.
We design two kinds of perturbation-based regularizers, including random-noise-based and adversarial-based.
Our new algorithms improve the state-of-the-art prompt tuning methods by 1.94% and 2.34% on SuperGLUE and FewGLUE benchmarks, respectively.
arXiv Detail & Related papers (2023-05-03T20:30:51Z) - Towards the Semantic Weak Generalization Problem in Generative Zero-Shot
Learning: Ante-hoc and Post-hoc [89.68803484284408]
We present a simple and effective strategy lowering the previously unexplored factors that limit the performance ceiling of generative Zero-Shot Learning (ZSL)
We begin by formally defining semantic generalization, then look into approaches for reducing the semantic weak generalization problem.
In the ante-hoc phase, we augment the generator's semantic input, as well as relax the fitting target of the generator.
arXiv Detail & Related papers (2022-04-24T13:54:42Z) - Characterizing and addressing the issue of oversmoothing in neural
autoregressive sequence modeling [49.06391831200667]
We study the effect of the proposed regularization on both model distribution and decoding performance.
We conclude that the high degree of oversmoothing is the main reason behind the case of overly probable short sequences in a neural autoregressive model.
arXiv Detail & Related papers (2021-12-16T14:33:12Z) - Calibrated Surrogate Losses for Adversarially Robust Classification [92.37268323142307]
We show that no convex surrogate loss is respect with respect to adversarial 0-1 loss when restricted to linear models.
We also show that if the underlying distribution satisfies the Massart's noise condition, convex losses can also be calibrated in the adversarial setting.
arXiv Detail & Related papers (2020-05-28T02:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.