Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
- URL: http://arxiv.org/abs/2510.22085v1
- Date: Fri, 24 Oct 2025 23:53:16 GMT
- Title: Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
- Authors: Pavlos Ntais,
- Abstract summary: Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks.<n>We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts.<n>Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation in vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad applicability and model-specific defensive strengths in cybersecurity contexts. This represents a 54x improvement over direct prompting (1.5% ASR) and demonstrates systematic vulnerabilities in current safety alignment approaches. Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable, highlighting threats to AI-integrated threat detection, malware analysis, and secure systems, while physical harm categories show greater resistance (55.6% ASR). We employ automated harmfulness evaluation using Claude Sonnet 4, cross-validated with human expert assessment, ensuring reliable and scalable evaluation for cybersecurity red-teaming. Finally, we analyze failure mechanisms and discuss defensive strategies to mitigate these vulnerabilities in AI for cybersecurity.
Related papers
- ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations [0.0]
This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails.<n>SecureCAI reduces attack success rates by 94.7% compared to baseline models.
arXiv Detail & Related papers (2026-01-12T18:59:45Z) - AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications [71.27518152526686]
Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation.<n>LLMs can be manipulated by "adversarial instructions" hidden in input data, such as resumes or code, causing them to deviate from their intended task.<n>This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types.
arXiv Detail & Related papers (2025-12-23T08:42:09Z) - Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System [4.864011355064205]
This paper introduces an automated red-teaming framework that generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in large language models (LLMs)<n>Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories.<n> Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns.
arXiv Detail & Related papers (2025-12-21T19:12:44Z) - Securing AI Agents Against Prompt Injection Attacks [0.0]
We present a benchmark for evaluating prompt injection risks in RAG-enabled AI agents.<n>Our framework reduces successful attack rates from 73.2% to 8.7% while maintaining 94.3% of baseline task performance.
arXiv Detail & Related papers (2025-11-19T10:00:54Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models [68.09675063543402]
NeuroBreak is a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities.<n>By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model's decision-making process.<n>We conduct quantitative evaluations and case studies to verify the effectiveness of our system.
arXiv Detail & Related papers (2025-09-04T08:12:06Z) - From Attack Descriptions to Vulnerabilities: A Sentence Transformer-Based Approach [0.39134914399411086]
This paper evaluates 14 state-of-the-art sentence transformers for automatically identifying vulnerabilities from textual descriptions of attacks.<n>On average, 56% of the vulnerabilities identified by the MMPNet model are also represented within the CVE repository in conjunction with an attack.<n>A manual inspection of the results revealed the existence of 275 predicted links that were not documented in the MITRE repositories.
arXiv Detail & Related papers (2025-09-02T08:27:36Z) - Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents [4.303444472156151]
Large Language Model (LLM)-enabled agents are rapidly emerging across a wide range of applications.<n>This work presents the first study of time-of-check to time-of-use (TOCTOU) vulnerabilities in LLM-enabled agents.<n>We introduce TOCTOU-Bench, a benchmark with 66 realistic user tasks designed to evaluate this class of vulnerabilities.
arXiv Detail & Related papers (2025-08-23T22:41:49Z) - Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms [1.03121181235382]
Large Language Model (LLM) agents face security vulnerabilities spanning AI-specific and traditional software domains.<n>This study bridges this gap through comparative evaluation of Function Calling architecture and Model Context Protocol (MCP) deployment paradigms.<n>We tested 3,250 attack scenarios across seven language models, evaluating simple, composed, and chained attacks targeting both AI-specific threats and software vulnerabilities.
arXiv Detail & Related papers (2025-07-08T18:24:28Z) - AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids [53.2306792009435]
FaultGuard is the first framework for fault type and zone classification resilient to adversarial attacks.
We propose a low-complexity fault prediction model and an online adversarial training technique to enhance robustness.
Our model outclasses the state-of-the-art for resilient fault prediction benchmarking, with an accuracy of up to 0.958.
arXiv Detail & Related papers (2024-03-26T08:51:23Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.