Related papers: Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

URL: http://arxiv.org/abs/2511.14140v1
Date: Tue, 18 Nov 2025 04:59:10 GMT
Title: Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security
Authors: Hajun Kim, Hyunsik Na, Daeseon Choi,
Abstract summary: This paper introduces the Embedded Jailbreak template, which preserves the structure of existing templates while naturally embedding harmful queries within their context.<n>We propose a progressive prompt-engineering methodology to ensure template quality and consistency, alongside standardized protocols for generation and evaluation.
Score: 5.187020963919454
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the use of large language models (LLMs) continues to expand, ensuring their safety and robustness has become a critical challenge. In particular, jailbreak attacks that bypass built-in safety mechanisms are increasingly recognized as a tangible threat across industries, driving the need for diverse templates to support red-teaming efforts and strengthen defensive techniques. However, current approaches predominantly rely on two limited strategies: (i) substituting harmful queries into fixed templates, and (ii) having the LLM generate entire templates, which often compromises intent clarity and reproductibility. To address this gap, this paper introduces the Embedded Jailbreak Template, which preserves the structure of existing templates while naturally embedding harmful queries within their context. We further propose a progressive prompt-engineering methodology to ensure template quality and consistency, alongside standardized protocols for generation and evaluation. Together, these contributions provide a benchmark that more accurately reflects real-world usage scenarios and harmful intent, facilitating its application in red-teaming and policy regression testing.

Related papers

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models [17.94525181892254]
Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors.<n>Their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs.<n>Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison.<n>Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.
arXiv Detail & Related papers (2025-10-17T09:38:54Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs [39.85609149662187]
We present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs.<n>Our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs.<n>Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models.
arXiv Detail & Related papers (2025-07-15T08:44:46Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region [13.962617572588393]
We show that template-anchored safety alignment is widespread across various aligned large language models (LLMs)<n>Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks.<n>We show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks.
arXiv Detail & Related papers (2025-02-19T18:42:45Z)
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking [34.479355499938116]
Large Language Models (LLMs) have led to significant applications but also introduced serious security threats.<n>We introduce a black-box attack framework called AttackPrefixTree (APT)<n>APT exploits structured output interfaces to dynamically construct attack patterns.<n> Experiments on benchmark datasets indicate that this approach achieves higher attack success rate than existing methods.
arXiv Detail & Related papers (2025-02-19T08:29:36Z)
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models [13.887770576598646]
Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions.<n>This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters.<n>TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions.
arXiv Detail & Related papers (2024-12-11T08:44:15Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.