Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
- URL: http://arxiv.org/abs/2512.19011v1
- Date: Mon, 22 Dec 2025 04:00:35 GMT
- Title: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
- Authors: Akshaj Prashanth Rao, Advait Singh, Saumya Kumaar Saksena, Dhruv Kumar,
- Abstract summary: Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems.<n>We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline.
- Score: 1.2802720336459552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead. Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators. Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications.
Related papers
- AWE: Adaptive Agents for Dynamic Web Penetration Testing [0.0]
AWE is a memory-augmented multi-agent framework for autonomous web penetration testing.<n>It embeds structured, vulnerability-specific analysis pipelines within a lightweight LLM orchestration layer.<n>AWE achieves substantial gains on injection-class vulnerabilities.
arXiv Detail & Related papers (2026-03-01T07:32:42Z) - Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs [2.2448294058653455]
adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards.<n>We propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts.<n>ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining.
arXiv Detail & Related papers (2026-01-18T11:33:35Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection [1.329253775274691]
DeReC is a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks.<n>By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient.<n>Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks.
arXiv Detail & Related papers (2025-11-06T18:35:45Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Large Language Model-Based Framework for Explainable Cyberattack Detection in Automatic Generation Control Systems [5.99333254967625]
This paper proposes a hybrid framework that integrates machine learning (ML) and natural language explanations (LLMs) to detect cyberattacks.<n>The proposed framework effectively real-time detection with interpretable, high-fidelity explanations, addressing a critical need for actionable AI in smart grid cybersecurity.
arXiv Detail & Related papers (2025-07-29T21:23:08Z) - Side-Channel Extraction of Dataflow AI Accelerator Hardware Parameters [2.5118823309854323]
This paper proposes a methodology to recover the hardware configuration of dataflow accelerators generated with the FINN framework.<n>We demonstrate an attack phase requiring only 337 ms to recover the hardware parameters with an accuracy of more than 95% and 421 ms to fully recover these parameters.<n>This approach offers a more realistic attack scenario than existing methods, and compared to SoA attacks based on tsfresh, our method requires 940x and 110x less time for preparation and attack phases, respectively.
arXiv Detail & Related papers (2025-06-18T13:06:09Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.<n>We propose a principled metric to replace each pruned block using a weight-sharing mechanism.<n> Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.