NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks
- URL: http://arxiv.org/abs/2511.11784v1
- Date: Fri, 14 Nov 2025 14:43:54 GMT
- Title: NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks
- Authors: Lama Sleem, Jerome Francois, Lujun Li, Nathan Foucher, Niccolo Gentile, Radu State,
- Abstract summary: Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines.<n>This work introduces a semantic consistency analysis between successful and unsuccessful responses, demonstrating that a negation-aware scoring approach captures meaningful patterns.<n>A novel detection framework called NegBLEURT Forest is proposed to evaluate the degree of alignment between outputs elicited by adversarial prompts and expected safe behaviors.<n>It identifies anomalous responses using the Isolation Forest algorithm, enabling reliable jailbreak detection.
- Score: 8.416892421891761
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines. Crafting universal filtering rules remains difficult due to their inherent dependence on specific contexts. To address these challenges without relying on threshold calibration or model fine-tuning, this work introduces a semantic consistency analysis between successful and unsuccessful responses, demonstrating that a negation-aware scoring approach captures meaningful patterns. Building on this insight, a novel detection framework called NegBLEURT Forest is proposed to evaluate the degree of alignment between outputs elicited by adversarial prompts and expected safe behaviors. It identifies anomalous responses using the Isolation Forest algorithm, enabling reliable jailbreak detection. Experimental results show that the proposed method consistently achieves top-tier performance, ranking first or second in accuracy across diverse models using the crafted dataset, while competing approaches exhibit notable sensitivity to model and data variations.
Related papers
- ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models [8.765213350762748]
jailbreak attacks bypass alignment safeguards to elicit harmful outputs.<n>We propose ForgeDAN, a novel framework for generating semantically coherent and highly effective adversarial prompts.<n>Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.
arXiv Detail & Related papers (2025-11-17T16:19:21Z) - Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability [5.650647159993238]
Diffusion language models (DLMs) generate tokens in parallel through iterative denoising.<n>In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process.<n>We propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate states.
arXiv Detail & Related papers (2025-10-01T06:35:23Z) - On the Adversarial Robustness of Learning-based Conformal Novelty Detection [10.58528988397402]
We study the adversarial robustness of conformal novelty detection using AdaDetect.<n>Our results show that adversarial perturbations can significantly increase the FDR while maintaining high detection power.
arXiv Detail & Related papers (2025-10-01T03:29:11Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models [62.83226685925107]
Deceptive Reasoning Exposure Suite (D-REX) is a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output.<n>Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought.<n>We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms.
arXiv Detail & Related papers (2025-09-22T15:59:40Z) - Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z) - False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize [30.448801772258644]
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities.<n>Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations.<n>Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness.
arXiv Detail & Related papers (2025-09-04T05:15:55Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning [9.960675988638805]
We propose a novel framework called fake audio detection with evidential learning (FADEL)<n>FADEL incorporates model uncertainty into its predictions, thereby leading to more robust performance in OOD scenarios.<n>We demonstrate the validity of uncertainty estimation by analyzing a strong correlation between average uncertainty and equal error rate (EER) across different spoofing algorithms.
arXiv Detail & Related papers (2025-04-22T07:40:35Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Exploring Robustness of Unsupervised Domain Adaptation in Semantic
Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space.
This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.