Learning to Inject: Automated Prompt Injection via Reinforcement Learning
- URL: http://arxiv.org/abs/2602.05746v1
- Date: Thu, 05 Feb 2026 15:14:46 GMT
- Title: Learning to Inject: Automated Prompt Injection via Reinforcement Learning
- Authors: Xin Chen, Jie Zhang, Florian Tramer,
- Abstract summary: We propose a reinforcement learning framework that generates universal, transferable adversarial suffixes.<n>Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks.<n>We successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash.
- Score: 11.25949760375263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
Related papers
- MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution [28.062506040151153]
Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations.<n>The heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable.<n>We propose textbfMulVul, a retrieval-augmented multi-agent framework for precise and broad-coverage vulnerability detection.
arXiv Detail & Related papers (2026-01-26T12:43:10Z) - Defense Against Indirect Prompt Injection via Tool Result Parsing [5.69701430275527]
LLM agents face an escalating threat from indirect prompt injection.<n>This vulnerability poses a significant risk as agents gain more direct control over physical environments.<n>We propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code.
arXiv Detail & Related papers (2026-01-08T10:21:56Z) - AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts [40.29708628615311]
AutoPrompT is a black-box framework that automatically generates human-readable adversarial suffixes for benign prompts.<n>In this paper, we introduce a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter.<n>Experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts.
arXiv Detail & Related papers (2025-10-28T03:32:14Z) - AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming [58.70941433155648]
AutoRed is a free-form adversarial prompt generation framework that removes the need for seed instructions.<n>We build two red teaming datasets and evaluate eight state-of-the-art Large Language Models.<n>Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for safety evaluation.
arXiv Detail & Related papers (2025-10-09T15:17:28Z) - AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema [39.44407870355891]
We propose AEGIS, an automated co-Evolutionary framework for Guarding prompt Injections.<n>Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique.<n>We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines.
arXiv Detail & Related papers (2025-08-27T12:25:45Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - Manipulating Multimodal Agents via Cross-Modal Prompt Injection [34.35145839873915]
We identify a critical yet previously overlooked security vulnerability in multimodal agents.<n>We propose CrossInject, a novel attack framework in which attackers embed adversarial perturbations across multiple modalities.<n>Our method outperforms state-of-the-art attacks, achieving at least a +30.1% increase in attack success rates.
arXiv Detail & Related papers (2025-04-19T16:28:03Z) - Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.