Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
- URL: http://arxiv.org/abs/2409.18169v5
- Date: Tue, 03 Dec 2024 06:52:11 GMT
- Title: Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
- Authors: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu,
- Abstract summary: Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns.
The attack, known as harmful fine-tuning attack, has raised a broad research interest among the community.
This paper provide a comprehensive overview to three aspects of harmful fine-tuning: attacks setting, defense design and evaluation methodology.
- Score: 7.945893812374361
- License:
- Abstract: Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns -- fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has raised a broad research interest among the community. However, as the attack is still new, \textbf{we observe that there are general misunderstandings within the research community.} To clear up concern, this paper provide a comprehensive overview to three aspects of harmful fine-tuning: attacks setting, defense design and evaluation methodology. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we introduce the evaluation methodology and outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.
Related papers
- The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models [10.524960491460945]
Fine-tuning attacks can exploit large language models to reveal potentially harmful behaviours.
This paper investigates the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks.
We aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.
arXiv Detail & Related papers (2025-02-03T10:28:26Z) - Model Inversion Attacks: A Survey of Approaches and Countermeasures [59.986922963781]
Recently, a new type of privacy attack, the model inversion attacks (MIAs), aims to extract sensitive features of private data for training.
Despite the significance, there is a lack of systematic studies that provide a comprehensive overview and deeper insights into MIAs.
This survey aims to summarize up-to-date MIA methods in both attacks and defenses.
arXiv Detail & Related papers (2024-11-15T08:09:28Z) - Dealing Doubt: Unveiling Threat Models in Gradient Inversion Attacks under Federated Learning, A Survey and Taxonomy [10.962424750173332]
Federated Learning (FL) has emerged as a leading paradigm for decentralized, privacy preserving machine learning training.
Recent research on gradient inversion attacks (GIAs) have shown that gradient updates in FL can leak information on private training samples.
We present a survey and novel taxonomy of GIAs that emphasize FL threat models, particularly that of malicious servers and clients.
arXiv Detail & Related papers (2024-05-16T18:15:38Z) - Why Should Adversarial Perturbations be Imperceptible? Rethink the
Research Paradigm in Adversarial NLP [83.66405397421907]
We rethink the research paradigm of textual adversarial samples in security scenarios.
We first collect, process, and release a security datasets collection Advbench.
Next, we propose a simple method based on rules that can easily fulfill the actual adversarial goals to simulate real-world attack methods.
arXiv Detail & Related papers (2022-10-19T15:53:36Z) - Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against
Fact-Verification Systems [80.3811072650087]
We show that it is possible to subtly modify claim-salient snippets in the evidence and generate diverse and claim-aligned evidence.
The attacks are also robust against post-hoc modifications of the claim.
These attacks can have harmful implications on the inspectable and human-in-the-loop usage scenarios.
arXiv Detail & Related papers (2022-09-07T13:39:24Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - A Survey on Gradient Inversion: Attacks, Defenses and Future Directions [81.46745643749513]
We present a comprehensive survey on GradInv, aiming to summarize the cutting-edge research and broaden the horizons for different domains.
Firstly, we propose a taxonomy of GradInv attacks by characterizing existing attacks into two paradigms: iteration- and recursion-based attacks.
Second, we summarize emerging defense strategies against GradInv attacks. We find these approaches focus on three perspectives covering data obscuration, model improvement and gradient protection.
arXiv Detail & Related papers (2022-06-15T03:52:51Z) - Wild Patterns Reloaded: A Survey of Machine Learning Security against
Training Data Poisoning [32.976199681542845]
We provide a comprehensive systematization of poisoning attacks and defenses in machine learning.
We start by categorizing the current threat models and attacks, and then organize existing defenses accordingly.
We argue that our systematization also encompasses state-of-the-art attacks and defenses for other data modalities.
arXiv Detail & Related papers (2022-05-04T11:00:26Z) - Poisoning Attacks and Defenses on Artificial Intelligence: A Survey [3.706481388415728]
Data poisoning attacks represent a type of attack that consists of tampering the data samples fed to the model during the training phase, leading to a degradation in the models accuracy during the inference phase.
This work compiles the most relevant insights and findings found in the latest existing literatures addressing this type of attacks.
A thorough assessment is performed on the reviewed works, comparing the effects of data poisoning on a wide range of ML models in real-world conditions.
arXiv Detail & Related papers (2022-02-21T14:43:38Z) - A Review of Adversarial Attack and Defense for Classification Methods [78.50824774203495]
This paper focuses on the generation and guarding of adversarial examples.
It is the hope of the authors that this paper will encourage more statisticians to work on this important and exciting field of generating and defending against adversarial examples.
arXiv Detail & Related papers (2021-11-18T22:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.