Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
- URL: http://arxiv.org/abs/2509.14271v1
- Date: Mon, 15 Sep 2025 19:14:01 GMT
- Title: Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models
- Authors: Gustavo Sandoval, Denys Fenchenko, Junyao Chen,
- Abstract summary: This paper documents early research conducted in 2022 on defending against prompt injection attacks in large language models.<n>We examine how to construct these attacks, test them on various large language models, and compare their effectiveness.<n>We propose and evaluate a novel defense technique called Adversarial Fine-Tuning.
- Score: 2.6947234418203347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper documents early research conducted in 2022 on defending against prompt injection attacks in large language models, providing historical context for the evolution of this critical security domain. This research focuses on two adversarial attacks against Large Language Models (LLMs): prompt injection and goal hijacking. We examine how to construct these attacks, test them on various LLMs, and compare their effectiveness. We propose and evaluate a novel defense technique called Adversarial Fine-Tuning. Our results show that, without this defense, the attacks succeeded 31\% of the time on GPT-3 series models. When using our Adversarial Fine-Tuning approach, attack success rates were reduced to near zero for smaller GPT-3 variants (Ada, Babbage, Curie), though we note that subsequent research has revealed limitations of fine-tuning-based defenses. We also find that more flexible models exhibit greater vulnerability to these attacks. Consequently, large models such as GPT-3 Davinci are more vulnerable than smaller models like GPT-2. While the specific models tested are now superseded, the core methodology and empirical findings contributed to the foundation of modern prompt injection defense research, including instruction hierarchy systems and constitutional AI approaches.
Related papers
- Securing Large Language Models (LLMs) from Prompt Injection Attacks [0.0]
Large Language Models (LLMs) are increasingly being deployed in real-world applications, but their flexibility exposes them to prompt injection attacks.<n>Recent work has proposed JATMO, a task-specific fine-tuning approach that trains non-instruction-tuned base models to perform a single function.<n>We evaluate the robustness of JATMO against HOUYI, a genetic attack framework that systematically mutates and optimize adversarial prompts.
arXiv Detail & Related papers (2025-12-01T06:34:20Z) - "Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers [23.25377752659151]
Recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations.<n>We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimize the injection prompt against a simulated reviewer model to maximize its effectiveness.<n>Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.
arXiv Detail & Related papers (2025-11-03T07:04:22Z) - The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage [71.8564105095189]
We introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model.<n>We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods.<n>We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference.
arXiv Detail & Related papers (2025-08-13T08:35:16Z) - A Critical Evaluation of Defenses against Prompt Injection Attacks [95.81023801370073]
Large Language Models (LLMs) are vulnerable to prompt injection attacks.<n>Several defenses have recently been proposed, often claiming to mitigate these attacks successfully.<n>We argue that existing studies lack a principled approach to evaluating these defenses.
arXiv Detail & Related papers (2025-05-23T19:39:56Z) - Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z) - The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z) - Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG)
PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output.
We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z) - Evaluating Adversarial Robustness: A Comparison Of FGSM, Carlini-Wagner Attacks, And The Role of Distillation as Defense Mechanism [0.0]
The study explores adversarial attacks specifically targeted at Deep Neural Networks (DNNs) utilized for image classification.
The research focuses on comprehending the ramifications of two prominent attack methodologies: the Fast Gradient Sign Method (FGSM) and the Carlini-Wagner (CW) approach.
The study proposes the robustness of defensive distillation as a defense mechanism to counter FGSM and CW attacks.
arXiv Detail & Related papers (2024-04-05T17:51:58Z) - Defense Against Model Extraction Attacks on Recommender Systems [53.127820987326295]
We introduce Gradient-based Ranking Optimization (GRO) to defend against model extraction attacks on recommender systems.
GRO aims to minimize the loss of the protected target model while maximizing the loss of the attacker's surrogate model.
Results show GRO's superior effectiveness in defending against model extraction attacks.
arXiv Detail & Related papers (2023-10-25T03:30:42Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance.
This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z) - The Space of Adversarial Strategies [6.295859509997257]
Adversarial examples, inputs designed to induce worst-case behavior in machine learning models, have been extensively studied over the past decade.
We propose a systematic approach to characterize worst-case (i.e., optimal) adversaries.
arXiv Detail & Related papers (2022-09-09T20:53:11Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.