Related papers: Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

URL: http://arxiv.org/abs/2602.16520v1
Date: Wed, 18 Feb 2026 15:07:09 GMT
Title: Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents
Authors: Doron Shavit,
Abstract summary: We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs)<n>RLM-JB treats detection as a procedure rather than a one-shot classification.<n>On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.

Related papers

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models [2.6140509675507384]
We study jailbreaking from both security and interpretability perspectives.<n>We propose a tensor-based latent representation framework that captures structure in hidden activations.<n>Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures.
arXiv Detail & Related papers (2026-02-12T02:43:17Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification [47.135407245022115]
Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data.<n>We propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts.<n>Building upon these insights, we introduce ALERT, an efficient and effective zero-shot jailbreak detector.
arXiv Detail & Related papers (2026-01-07T05:30:53Z)
Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models [12.772312329709868]
Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks.<n>We propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection.<n>MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection.
arXiv Detail & Related papers (2025-12-03T01:40:40Z)
The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z)
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks [0.31984926651189866]
Sentra-Guard is a real-time modular defense system for large language models (LLMs)<n>The framework uses a hybrid architecture with FAISS-indexed SBERT embedding representations that capture the semantic meaning of prompts.<n>It identifies adversarial prompts in both direct and obfuscated attack vectors.
arXiv Detail & Related papers (2025-10-26T11:19:47Z)
Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models [75.29749026964154]
Ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks.<n>Clean accuracy and utility are preserved within 0.5% of the original model.<n>The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.
arXiv Detail & Related papers (2025-10-11T15:47:35Z)
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models [31.35788474507371]
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks.<n>We present STShield, a lightweight framework for real-time jailbroken judgement.
arXiv Detail & Related papers (2025-03-23T04:23:07Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.<n>We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.<n>We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.