RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation
- URL: http://arxiv.org/abs/2511.18790v1
- Date: Mon, 24 Nov 2025 05:42:54 GMT
- Title: RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation
- Authors: Benyamin Tafreshian,
- Abstract summary: This paper presents an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt.<n>We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected.<n>Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Content moderation pipelines for modern large language models combine static filters, dedicated moderation services, and alignment tuned base models, yet real world deployments still exhibit dangerous failure modes. This paper presents RoguePrompt, an automated jailbreak attack that converts a disallowed user query into a self reconstructing prompt which passes provider moderation while preserving the original harmful intent. RoguePrompt partitions the instruction across two lexical streams, applies nested classical ciphers, and wraps the result in natural language directives that cause the target model to decode and execute the hidden payload. Our attack assumes only black box access to the model and to the associated moderation endpoint. We instantiate RoguePrompt against GPT 4o and evaluate it on 2 448 prompts that a production moderation system previously marked as strongly rejected. Under an evaluation protocol that separates three security relevant outcomes bypass, reconstruction, and execution the attack attains 84.7 percent bypass, 80.2 percent reconstruction, and 71.5 percent full execution, substantially outperforming five automated jailbreak baselines. We further analyze the behavior of several automated and human aligned evaluators and show that dual layer lexical transformations remain effective even when detectors rely on semantic similarity or learned safety rubrics. Our results highlight systematic blind spots in current moderation practice and suggest that robust deployment will require joint reasoning about user intent, decoding workflows, and model side computation rather than surface level toxicity alone.
Related papers
- In-Context Environments Induce Evaluation-Awareness in Language Models [0.12691047660244334]
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task.<n>We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment.<n>We show that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
arXiv Detail & Related papers (2026-03-04T08:22:02Z) - Regular Expression Denial of Service Induced by Backreferences [13.04731556594332]
This paper presents the first systematic study of denial-of-service vulnerabilities in Regular Expressions with Backreferences (REwB)<n>Using the Two-Phase Memory Automaton (2PMFA), we derive necessary conditions under which backreferences induce super-linear backtracking runtime.<n>We identify three vulnerability patterns, develop detection and attack-construction algorithms, and validate them in practice.
arXiv Detail & Related papers (2026-02-25T00:23:50Z) - CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents [27.35968236632966]
LLM-based code interpreter agents are increasingly deployed in critical situations.<n>Existing benchmarks fail to capture the security risks arising from dynamic code execution, tool interactions, and multi-turn context.<n>We introduce CIBER, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation.
arXiv Detail & Related papers (2026-02-23T06:41:41Z) - RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse [47.85771791033142]
We propose RedVisor, a framework that synthesizes the explainability of detection systems with the seamless integration of prevention strategies.<n>RedVisor is the first approach to leverage fine-grained reasoning paths to simultaneously detect attacks and guide the model's safe response.<n> Experiments demonstrate that RedVisor outperforms state-of-the-art defenses in detection accuracy and throughput while incurring negligible utility loss.
arXiv Detail & Related papers (2026-02-02T08:26:51Z) - The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation [11.984098021215878]
We introduce the Semantic-Preserving Adrial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP)<n>These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree.<n>Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3.
arXiv Detail & Related papers (2026-01-29T07:40:58Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - Countermind: A Multi-Layered Security Architecture for Large Language Models [0.0]
This paper proposes Countermind, a multi-layered security architecture intended to shift defenses from a reactive, post hoc posture to a proactive, pre-inference, and intra-inference enforcement model.<n>The architecture proposes a fortified perimeter designed to structurally validate and transform all inputs, and an internal governance mechanism intended to constrain the model's semantic processing pathways before an output is generated.
arXiv Detail & Related papers (2025-10-13T18:41:18Z) - Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models [75.29749026964154]
Ourmethod reduces the average Attack Success Rate to 4.41% across multiple benchmarks.<n>Clean accuracy and utility are preserved within 0.5% of the original model.<n>The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.
arXiv Detail & Related papers (2025-10-11T15:47:35Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs [39.85609149662187]
We present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs.<n>Our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs.<n>Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models.
arXiv Detail & Related papers (2025-07-15T08:44:46Z) - GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization [19.44247617251449]
We introduce GhostPrompt, the first automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback.<n>It achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 12.5% to 99.0%.<n>It generalizes to unseen filters including GPT-4.1 and successfully jailbreaks DALLE 3 to generate NSFW images.
arXiv Detail & Related papers (2025-05-25T05:13:06Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [50.40122190627256]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language.<n>We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.