Related papers: Fundamental Limitations in Defending LLM Finetuning APIs

Fundamental Limitations in Defending LLM Finetuning APIs

URL: http://arxiv.org/abs/2502.14828v1
Date: Thu, 20 Feb 2025 18:45:01 GMT
Title: Fundamental Limitations in Defending LLM Finetuning APIs
Authors: Xander Davies, Eric Winsor, Tomek Korbak, Alexandra Souly, Robert Kirk, Christian Schroeder de Witt, Yarin Gal,
Abstract summary: We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks.<n>We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge.<n>We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
Score: 61.29028411001255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.

Related papers

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning [12.293101110323722]
Fine-tuning-as-a-service exposes models to harmful fine-tuning attacks.<n>We propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse.<n>This collapse directly neutralizes the very general capabilities that attackers exploit.
arXiv Detail & Related papers (2025-05-22T11:47:08Z)
Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis [3.795071937009966]
Adrial attacks can jeopardize the integrity of Machine Learning (ML) models. We propose a framework that detects if an adversarial noise instance is being generated. We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks.
arXiv Detail & Related papers (2025-03-04T20:25:12Z)
No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms [22.667573777927203]
We present a new fine-tuning attack that trains models to first refuse harmful requests before answering them.<n>This "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters.<n>Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic.
arXiv Detail & Related papers (2025-02-26T20:20:01Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
QUEEN: Query Unlearning against Model Extraction [22.434812818540966]
Model extraction attacks pose a non-negligible threat to the security and privacy of deep learning models. We propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks.
arXiv Detail & Related papers (2024-07-01T13:01:41Z)
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z)
A Classification-by-Retrieval Framework for Few-Shot Anomaly Detection to Detect API Injection Attacks [9.693391036125908]
We propose a novel unsupervised few-shot anomaly detection framework composed of two main parts. First, we train a dedicated generic language model for API based on FastText embedding. Next, we use Approximate Nearest Neighbor search in a classification-by-retrieval approach.
arXiv Detail & Related papers (2024-05-18T10:15:31Z)
Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification [22.078088272837068]
Federated Learning (FL) systems are susceptible to adversarial attacks. Current defense methods are often impractical for real-world FL systems. We propose a novel anomaly detection strategy that is designed for real-world FL systems.
arXiv Detail & Related papers (2023-10-06T07:09:05Z)
RamBoAttack: A Robust Query Efficient Deep Neural Network Decision Exploit [9.93052896330371]
We develop a robust query efficient attack capable of avoiding entrapment in a local minimum and misdirection from noisy gradients. The RamBoAttack is more robust to the different sample inputs available to an adversary and the targeted class.
arXiv Detail & Related papers (2021-12-10T01:25:24Z)
Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms. We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection. Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z)
A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems. This paper proposes a self-supervised adversarial training mechanism in the input space. It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.