Related papers: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

URL: http://arxiv.org/abs/2510.04885v1
Date: Mon, 06 Oct 2025 15:06:04 GMT
Title: RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection
Authors: Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, Chuan Guo,
Abstract summary: We introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections.<n>We propose a set of practical techniques that enable highly effective, universal attacks.<n> RL-Hammer reaches a 98% ASR against GPT-4o and a $72%$ ASR against GPT-5 with the Instruction Hierarchy defense.
Score: 82.41836544860833
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.

Related papers

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety [28.246225272659917]
This paper introduces textbfMAGIC, a novel multi-turn multi-agent reinforcement learning framework.<n>It formulates Large Language Models safety alignment as an adversarial asymmetric game.<n>Our framework demonstrates superior defense success rates without compromising the helpfulness of the model.
arXiv Detail & Related papers (2026-02-02T02:12:28Z)
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections [74.60337113759313]
Current defenses against jailbreaks and prompt injections are typically evaluated against a static set of harmful attack strings.<n>We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design.
arXiv Detail & Related papers (2025-10-10T05:51:04Z)
Adversarial Reinforcement Learning for Large Language Model Agent Safety [20.704989548285372]
Large Language Model (LLM) agents can leverage tools like Google Search to complete complex tasks.<n>Current defense strategies rely on fine-tuning LLM agents on datasets of known attacks.<n>We propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game.
arXiv Detail & Related papers (2025-10-06T23:09:18Z)
Active Attacks: Red-teaming LLMs via Adaptive Environments [71.55110023234376]
We address the challenge of generating diverse attack prompts for large language models (LLMs)<n>We introduce textitActive Attacks, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves.
arXiv Detail & Related papers (2025-09-26T06:27:00Z)
May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks [14.307668562901263]
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data.<n>We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks.
arXiv Detail & Related papers (2025-07-10T04:20:53Z)
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [64.47869632167284]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z)
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization [0.0]
Malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency.<n>Existing defenses targeting supervised fine-tuning prove ineffective.<n>We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks.
arXiv Detail & Related papers (2025-05-07T17:18:48Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.<n>We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.<n>We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the Age of AI-NIDS [70.60975663021952]
We study blackbox adversarial attacks on network classifiers. We argue that attacker-defender fixed points are themselves general-sum games with complex phase transitions. We show that a continual learning approach is required to study attacker-defender dynamics.
arXiv Detail & Related papers (2021-11-23T23:42:16Z)
Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning [30.46580767540506]
We introduce two novel adversarial attack techniques to emphstealthily and emphefficiently attack the Deep Reinforcement Learning agents. The first technique is the emphcritical point attack: the adversary builds a model to predict the future environmental states and agent's actions, assesses the damage of each possible attack strategy, and selects the optimal one. The second technique is the emphantagonist attack: the adversary automatically learns a domain-agnostic model to discover the critical moments of attacking the agent in an episode.
arXiv Detail & Related papers (2020-05-14T16:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.