Related papers: DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

URL: http://arxiv.org/abs/2509.16870v1
Date: Sun, 21 Sep 2025 01:46:35 GMT
Title: DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems
Authors: Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua,
Abstract summary: DecipherGuard is a novel framework that integrates a deciphering layer to counter obfuscation-based prompts and a low-rank adaptation mechanism to enhance guardrail effectiveness against jailbreak attacks.<n> Empirical evaluation on over 22,000 prompts demonstrates that DecipherGuard improves DSR by 36% to 65% and Overall Guardrail Performance (OGP) by 20% to 50% compared to LlamaGuard and two other runtime guardrails.
Score: 11.606665113249298
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intelligent software systems powered by Large Language Models (LLMs) are increasingly deployed in critical sectors, raising concerns about their safety during runtime. Through an industry-academic collaboration when deploying an LLM-powered virtual customer assistant, a critical software engineering challenge emerged: how to enhance a safer deployment of LLM-powered software systems at runtime? While LlamaGuard, the current state-of-the-art runtime guardrail, offers protection against unsafe inputs, our study reveals a Defense Success Rate (DSR) drop of 24% under obfuscation- and template-based jailbreak attacks. In this paper, we propose DecipherGuard, a novel framework that integrates a deciphering layer to counter obfuscation-based prompts and a low-rank adaptation mechanism to enhance guardrail effectiveness against template-based attacks. Empirical evaluation on over 22,000 prompts demonstrates that DecipherGuard improves DSR by 36% to 65% and Overall Guardrail Performance (OGP) by 20% to 50% compared to LlamaGuard and two other runtime guardrails. These results highlight the effectiveness of DecipherGuard in defending LLM-powered software systems against jailbreak attacks during runtime.

Related papers

Bypassing Prompt Guards in Production with Controlled-Release Prompting [11.65770031195044]
We introduce a new attack that circumvents prompt guards, highlighting their limitations.<n>Our method consistently jailbreaks production models while maintaining response quality.<n>This reveals an attack surface inherent to lightweight prompt guards in modern LLM architectures.
arXiv Detail & Related papers (2025-10-02T00:04:21Z)
AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software [11.606665113249298]
Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software.<n>We propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs.<n>We show that AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation.
arXiv Detail & Related papers (2025-09-21T01:22:42Z)
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z)
SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems [9.469589800082597]
This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages.<n>It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts.<n>We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages.
arXiv Detail & Related papers (2025-07-11T05:15:35Z)
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges [70.85114705489222]
We propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation.<n>M MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.<n>Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities.
arXiv Detail & Related papers (2025-06-09T12:02:39Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems [4.225223514207515]
Large Language Models (LLMs) guardrail systems are designed to protect against prompt injection and jailbreak attacks.<n>We demonstrate two approaches for bypassing prompt injection and jailbreak detection systems.<n>We show that both methods can be used to evade detection while maintaining adversarial utility.
arXiv Detail & Related papers (2025-04-15T13:16:02Z)
CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment [66.72332011814183]
CoreGuard is a computation- and communication-efficient protection method for proprietary large language models (LLMs) deployed on edge devices.<n> CoreGuard employs an efficient protection protocol to reduce computational overhead and minimize communication overhead via a propagation protocol.
arXiv Detail & Related papers (2024-10-16T08:14:24Z)
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks [2.873719680183099]
This paper advocates for the significance of jailbreak attack prevention on Large Language Models (LLMs) We introduce MoJE, a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference.
arXiv Detail & Related papers (2024-09-26T10:12:19Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.