Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses
- URL: http://arxiv.org/abs/2505.15738v2
- Date: Thu, 16 Oct 2025 12:31:18 GMT
- Title: Checkpoint-GCG: Auditing and Attacking Fine-Tuning-Based Prompt Injection Defenses
- Authors: Xiaoxue Yang, Bozhidar Stevanoski, Matthieu Meeus, Yves-Alexandre de Montjoye,
- Abstract summary: We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses.<n>We show Checkpoint-GCG to achieve up to $96%$ attack success rate (ASR) against the strongest defense.
- Score: 10.08464073347558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly deployed in real-world applications ranging from chatbots to agentic systems, where they are expected to process untrusted data and follow trusted instructions. Failure to distinguish between the two poses significant security risks, exploited by prompt injection attacks, which inject malicious instructions into the data to control model outputs. Model-level defenses have been proposed to mitigate prompt injection attacks. These defenses fine-tune LLMs to ignore injected instructions in untrusted data. We introduce Checkpoint-GCG, a white-box attack against fine-tuning-based defenses. Checkpoint-GCG enhances the Greedy Coordinate Gradient (GCG) attack by leveraging intermediate model checkpoints produced during fine-tuning to initialize GCG, with each checkpoint acting as a stepping stone for the next one to continuously improve attacks. First, we instantiate Checkpoint-GCG to evaluate the robustness of the state-of-the-art defenses in an auditing setup, assuming both (a) full knowledge of the model input and (b) access to intermediate model checkpoints. We show Checkpoint-GCG to achieve up to $96\%$ attack success rate (ASR) against the strongest defense. Second, we relax the first assumption by searching for a universal suffix that would work on unseen inputs, and obtain up to $89.9\%$ ASR against the strongest defense. Finally, we relax both assumptions by searching for a universal suffix that would transfer to similar black-box models and defenses, achieving an ASR of $63.9\%$ against a newly released defended model from Meta.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks [0.2291770711277359]
Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts.<n>We introduce the textbfFour-Checkpoint Framework, which organizes safety mechanisms along two dimensions: processing stage (input vs. output) and detection level (literal vs. intent)<n>Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases.
arXiv Detail & Related papers (2026-02-10T10:17:25Z) - Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models [0.0]
We focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks.<n>Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates.
arXiv Detail & Related papers (2026-02-03T08:53:35Z) - Prototype-Guided Robust Learning against Backdoor Attacks [16.60001324267935]
Backdoor attacks poison the training data to embed a backdoor in the model.<n>We propose Prototype-Guided Robust Learning (PGRL) to be robust against diverse backdoor attacks.
arXiv Detail & Related papers (2025-09-03T14:41:54Z) - The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage [71.8564105095189]
We introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model.<n>We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods.<n>We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference.
arXiv Detail & Related papers (2025-08-13T08:35:16Z) - Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z) - Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z) - Mind the Gap: Detecting Black-box Adversarial Attacks in the Making through Query Update Analysis [3.795071937009966]
Adrial attacks can jeopardize the integrity of Machine Learning (ML) models.<n>We propose a framework that detects if an adversarial noise instance is being generated.<n>We evaluate our approach against 8 state-of-the-art attacks, including adaptive attacks.
arXiv Detail & Related papers (2025-03-04T20:25:12Z) - Attack-in-the-Chain: Bootstrapping Large Language Models for Attacks Against Black-box Neural Ranking Models [111.58315434849047]
We introduce a novel ranking attack framework named Attack-in-the-Chain.<n>It tracks interactions between large language models (LLMs) and Neural ranking models (NRMs) based on chain-of-thought.<n> Empirical results on two web search benchmarks show the effectiveness of our method.
arXiv Detail & Related papers (2024-12-25T04:03:09Z) - InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models [7.186499635424984]
Prompt injection attacks pose a critical threat to large language models (LLMs)<n> Prompt guard models, though effective in defense, suffer from over-defense due to trigger word bias.<n>InjecGuard is a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free.
arXiv Detail & Related papers (2024-10-30T07:39:42Z) - Enhancing Adversarial Attacks through Chain of Thought [0.0]
gradient-based adversarial attacks are particularly effective against aligned large language models (LLMs)
This paper proposes enhancing the universality of adversarial attacks by integrating CoT prompts with the greedy coordinate gradient (GCG) technique.
arXiv Detail & Related papers (2024-10-29T06:54:00Z) - AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries.
We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots.
We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Guidance Through Surrogate: Towards a Generic Diagnostic Attack [101.36906370355435]
We develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA)
Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size.
More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
arXiv Detail & Related papers (2022-12-30T18:45:23Z) - Understanding the Vulnerability of Skeleton-based Human Activity Recognition via Black-box Attack [53.032801921915436]
Human Activity Recognition (HAR) has been employed in a wide range of applications, e.g. self-driving cars.
Recently, the robustness of skeleton-based HAR methods have been questioned due to their vulnerability to adversarial attacks.
We show such threats exist, even when the attacker only has access to the input/output of the model.
We propose the very first black-box adversarial attack approach in skeleton-based HAR called BASAR.
arXiv Detail & Related papers (2022-11-21T09:51:28Z) - Scale-Invariant Adversarial Attack for Evaluating and Enhancing
Adversarial Defenses [22.531976474053057]
Projected Gradient Descent (PGD) attack has been demonstrated to be one of the most successful adversarial attacks.
We propose Scale-Invariant Adversarial Attack (SI-PGD), which utilizes the angle between the features in the penultimate layer and the weights in the softmax layer to guide the generation of adversaries.
arXiv Detail & Related papers (2022-01-29T08:40:53Z) - Detection as Regression: Certified Object Detection by Median Smoothing [50.89591634725045]
This work is motivated by recent progress on certified classification by randomized smoothing.
We obtain the first model-agnostic, training-free, and certified defense for object detection against $ell$-bounded attacks.
arXiv Detail & Related papers (2020-07-07T18:40:19Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.