A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
- URL: http://arxiv.org/abs/2508.07139v1
- Date: Sun, 10 Aug 2025 01:59:07 GMT
- Title: A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
- Authors: Ivan Zhang,
- Abstract summary: We introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint.<n>We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks.
- Score: 0.1813006808606333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.
Related papers
- Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification [18.006622965818856]
We introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs.<n>Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks.<n>During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens.
arXiv Detail & Related papers (2025-04-18T09:02:12Z) - LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts.<n>We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z) - STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models [31.35788474507371]
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks.<n>We present STShield, a lightweight framework for real-time jailbroken judgement.
arXiv Detail & Related papers (2025-03-23T04:23:07Z) - DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing [62.43110639295449]
Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks.<n>Delman is a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks.<n>Delman directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model's utility.
arXiv Detail & Related papers (2025-02-17T10:39:21Z) - ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs [4.534938642552179]
ShieldLearner is a novel paradigm that mimics human learning in defense.<n>Through trial and error, it autonomously distills attack signatures into a Pattern Atlas.<n> Adaptive Adversarial Augmentation generates adversarial variations of successfully defended prompts.
arXiv Detail & Related papers (2025-02-16T18:47:41Z) - Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
Large language models (LLMs) are becoming more capable and widespread.<n>Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.<n>In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs.
arXiv Detail & Related papers (2025-02-03T18:59:01Z) - Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks [23.793583584784685]
Large language models (LLMs) are susceptible to jailbreak attacks, which exploit system vulnerabilities to circumvent safety measures and elicit harmful or inappropriate outputs.<n>We introduce LATPC, a Latent-space Adrial Training with Post-aware framework.<n>LATPC identifies safety-critical latent dimensions by contrasting harmful and benign inputs, enabling the adaptive construction of targeted refusal feature removal attacks.
arXiv Detail & Related papers (2025-01-18T02:57:12Z) - Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks.<n>We propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.