Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
- URL: http://arxiv.org/abs/2406.20053v1
- Date: Fri, 28 Jun 2024 17:05:46 GMT
- Title: Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
- Authors: Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt,
- Abstract summary: Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
- Score: 86.05704141217036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
Related papers
- Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence [33.73351876121039]
Fine-tuning-as-a-service introduces a threat to Large Language Models' safety when service providers fine-tune their models on user-submitted datasets.<n>We show that by regularizing the contribution of harmful samples encountered during fine-tuning, we can effectively mitigate the impact of harmful fine-tuning attacks.
arXiv Detail & Related papers (2026-02-28T06:46:21Z) - Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler [67.24175911858312]
Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models.<n>Bayesian Data Scheduler (BDS) is an adaptive tuning-stage defense strategy with no need for attack simulation.<n>BDS learns the posterior distribution of each data point's safety attribute, conditioned on the fine-tuning and alignment datasets.
arXiv Detail & Related papers (2025-10-31T04:49:37Z) - Detecting Adversarial Fine-tuning with Auditing Agents [38.964973163076586]
We introduce the concept of a fine-tuning auditing agent and show it can detect harmful fine-tuning prior to model deployment.<n>We evaluate our detection approach on a diverse set of eight strong fine-tuning attacks from the literature, along with five benign fine-tuned models.<n>Most promising, the auditor is able to detect covert cipher attacks that evade safety evaluations and content moderation of the dataset.
arXiv Detail & Related papers (2025-10-17T23:01:16Z) - Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks [10.478976654618272]
adversaries can exploit large language model fine-tuning APIs to bypass model safety mechanisms.<n>We introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety.<n>We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches.
arXiv Detail & Related papers (2025-08-23T22:55:15Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks [32.73803760326097]
Finetuning-as-a-Service (F) allows users to customize Large Language Models (LLMs) using their own data.<n>Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data.<n>We propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework.
arXiv Detail & Related papers (2025-06-09T02:10:51Z) - LookAhead Tuning: Safer Language Models via Partial Answer Previews [38.7113305301502]
LookAhead Tuning mitigates the degradation of model safety during fine-tuning.
Two simple, low-resource, and effective data-driven methods modify training data by previewing partial answer prefixes.
arXiv Detail & Related papers (2025-03-24T18:11:42Z) - Fundamental Limitations in Defending LLM Finetuning APIs [61.29028411001255]
We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks.
We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge.
We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
arXiv Detail & Related papers (2025-02-20T18:45:01Z) - Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation [58.7395356511539]
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services.
Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective.
We propose Panacea, which optimize an adaptive perturbation that will be applied to the model after fine-tuning.
arXiv Detail & Related papers (2025-01-30T02:47:09Z) - Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference [16.893873979953593]
We propose a novel clean-label backdoor-based approach for stealthy data auditing.
Our approach employs an optimal trigger generated by a shadow model that mimics target model's behavior.
The proposed method enables robust data auditing through blackbox access, achieving high attack success rates across diverse datasets.
arXiv Detail & Related papers (2024-11-24T20:56:18Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Certified Robustness to Data Poisoning in Gradient-Based Training [10.79739918021407]
We develop the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data.
Our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks.
We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.
arXiv Detail & Related papers (2024-06-09T06:59:46Z) - Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment [56.2017039028998]
Fine-tuning of Language-Model-as-a-Service (LM) introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack)
We propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks.
Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 safety examples, the maliciously finetuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance.
arXiv Detail & Related papers (2024-02-22T21:05:18Z) - Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data.
We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z) - Fine-tuning Aligned Language Models Compromises Safety, Even When Users
Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning.
We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples.
We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.