AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema
- URL: http://arxiv.org/abs/2509.00088v1
- Date: Wed, 27 Aug 2025 12:25:45 GMT
- Title: AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema
- Authors: Ting-Chun Liu, Ching-Yu Hsu, Kuan-Yi Lee, Chi-An Fu, Hung-yi Lee,
- Abstract summary: We propose AEGIS, an automated co-Evolutionary framework for Guarding prompt Injections.<n>Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique.<n>We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines.
- Score: 39.44407870355891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompt injection attacks pose a significant challenge to the safe deployment of Large Language Models (LLMs) in real-world applications. While prompt-based detection offers a lightweight and interpretable defense strategy, its effectiveness has been hindered by the need for manual prompt engineering. To address this issue, we propose AEGIS , an Automated co-Evolutionary framework for Guarding prompt Injections Schema. Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique. This framework enables both attackers and defenders to autonomously evolve via a Textual Gradient Optimization (TGO) module, leveraging feedback from an LLM-guided evaluation loop. We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines, achieving superior robustness in both attack success and detection. Specifically, the attack success rate (ASR) reaches 1.0, representing an improvement of 0.26 over the baseline. For detection, the true positive rate (TPR) improves by 0.23 compared to the previous best work, reaching 0.84, and the true negative rate (TNR) remains comparable at 0.89. Ablation studies confirm the importance of co-evolution, gradient buffering, and multi-objective optimization. We also confirm that this framework is effective in different LLMs. Our results highlight the promise of adversarial training as a scalable and effective approach for guarding prompt injections.
Related papers
- Proactive Hardening of LLM Defenses with HASTE [0.614338876867286]
Prompt-based attack techniques are one of the primary challenges in securely deploying and protecting LLM-based AI systems.<n>We present HASTE (Hard-negative Attack Sample Training Engine): a systematic framework that iteratively engineers highly evasive prompts.<n>The framework can be generalized to evaluate prompt-injection detection efficacy, with and without fuzzing, for any hard-negative or hard-positive iteration strategy.
arXiv Detail & Related papers (2026-01-27T00:19:34Z) - Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers [21.207996237794855]
We present the first systematic analysis of poisoning risks in LLM-based prompt optimization.<n>We find systems are substantially more vulnerable to manipulated feedback than to injected queries.<n>We propose a lightweight highlighting defense that reduces the fake-reward $Delta$ASR from 0.23 to 0.07 without degrading utility.
arXiv Detail & Related papers (2025-10-16T07:28:54Z) - Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking [44.8238758047607]
Existing adversarial attacks typically target harmful responses in single-point, greedy generations.<n>We propose a novel framework for adversarial evaluation that explicitly models the entire output distribution, including tail-risks.<n>Our framework also enables us to analyze how different attack algorithms affect output harm distributions.
arXiv Detail & Related papers (2025-07-06T16:13:33Z) - Semantic-Preserving Adversarial Attacks on LLMs: An Adaptive Greedy Binary Search Approach [15.658579092368981]
Large Language Models (LLMs) increasingly rely on automatic prompt engineering in graphical user interfaces (GUIs) to refine user inputs and enhance response accuracy.<n>We propose the Adaptive Greedy Binary Search (AGBS) method, which simulates common prompt optimization mechanisms while preserving semantic stability.
arXiv Detail & Related papers (2025-05-26T15:41:06Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - Unified Prompt Attack Against Text-to-Image Generation Models [30.24530622359188]
We propose UPAM, a framework to evaluate the robustness of T2I models from an attack perspective.<n>UPAM unifies the attack on both textual and visual defenses.<n>It also enables gradient-based optimization, overcoming reliance on enumeration for improved efficiency and effectiveness.
arXiv Detail & Related papers (2025-02-23T03:36:18Z) - Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection [6.269725911814401]
Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks.
However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem.
This project explores the security vulnerabilities in relation to prompt injection attacks.
arXiv Detail & Related papers (2024-10-28T00:36:21Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - Learn from the Past: A Proxy Guided Adversarial Defense Framework with
Self Distillation Regularization [53.04697800214848]
Adversarial Training (AT) is pivotal in fortifying the robustness of deep learning models.
AT methods, relying on direct iterative updates for target model's defense, frequently encounter obstacles such as unstable training and catastrophic overfitting.
We present a general proxy guided defense framework, LAST' (bf Learn from the Pbf ast)
arXiv Detail & Related papers (2023-10-19T13:13:41Z) - Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial
Robustness [53.094682754683255]
We propose a Model-Agnostic Meta-Attack (MAMA) approach to discover stronger attack algorithms automatically.
Our method learns the in adversarial attacks parameterized by a recurrent neural network.
We develop a model-agnostic training algorithm to improve the ability of the learned when attacking unseen defenses.
arXiv Detail & Related papers (2021-10-13T13:54:24Z) - Targeted Physical-World Attention Attack on Deep Learning Models in Road
Sign Recognition [79.50450766097686]
This paper proposes the targeted attention attack (TAA) method for real world road sign attack.
Experimental results validate that the TAA method improves the attack successful rate (nearly 10%) and reduces the perturbation loss (about a quarter) compared with the popular RP2 method.
arXiv Detail & Related papers (2020-10-09T02:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.