Related papers: The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models

URL: http://arxiv.org/abs/2512.13741v1
Date: Sun, 14 Dec 2025 18:10:29 GMT
Title: The Laminar Flow Hypothesis: Detecting Jailbreaks via Semantic Turbulence in Large Language Models
Authors: Md. Hasib Ur Rahman,
Abstract summary: Laminar Flow Hypothesis: benign inputs induce smooth, gradual transitions in an LLM's high-dimensional latent space.<n> adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence.<n>Tests show that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) become ubiquitous, the challenge of securing them against adversarial "jailbreaking" attacks has intensified. Current defense strategies often rely on computationally expensive external classifiers or brittle lexical filters, overlooking the intrinsic dynamics of the model's reasoning process. In this work, the Laminar Flow Hypothesis is introduced, which posits that benign inputs induce smooth, gradual transitions in an LLM's high-dimensional latent space, whereas adversarial prompts trigger chaotic, high-variance trajectories - termed Semantic Turbulence - resulting from the internal conflict between safety alignment and instruction-following objectives. This phenomenon is formalized through a novel, zero-shot metric: the variance of layer-wise cosine velocity. Experimental evaluation across diverse small language models reveals a striking diagnostic capability. The RLHF-aligned Qwen2-1.5B exhibits a statistically significant 75.4% increase in turbulence under attack (p less than 0.001), validating the hypothesis of internal conflict. Conversely, Gemma-2B displays a 22.0% decrease in turbulence, characterizing a distinct, low-entropy "reflex-based" refusal mechanism. These findings demonstrate that Semantic Turbulence serves not only as a lightweight, real-time jailbreak detector but also as a non-invasive diagnostic tool for categorizing the underlying safety architecture of black-box models.

Related papers

When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks [2.4923006485141284]
We demonstrate that encoder-side poisoning induces persistent, trigger-free semantic corruption.<n> backdoors act as low-rank, target-centered deformations that amplify local sensitivity, causing distortion to propagate coherently across semantic neighborhoods.<n>Our findings, validated across diffusion and contrastive paradigms, expose the deep structural risks of encoder poisoning and highlight the necessity of geometric audits beyond simple attack success rates.
arXiv Detail & Related papers (2026-02-21T23:48:04Z)
BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning [73.46118996284888]
Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence.<n>We propose BadCLIP++, a unified framework that tackles both challenges.<n>For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions.<n>For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment.
arXiv Detail & Related papers (2026-02-19T08:31:16Z)
Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting [44.23640219583819]
Reinforced Fine-Tuning (SFT) is the standard paradigm for domain adaptation, yet it frequently incurs the cost of catastrophic forgetting.<n>We propose Entropy-Adaptive Fine-Tuning (EAFT) to solve this problem.<n>EAFT consistently matches the downstream performance of standard SFT while significantly mitigating the degradation of general capabilities.
arXiv Detail & Related papers (2026-01-05T14:28:17Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
Drift No More? Context Equilibria in Multi-Turn LLM Interactions [58.69551510148673]
contexts drift is the gradual divergence of a model's outputs from goal-consistent behavior across turns.<n>Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics.<n>We show that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay.
arXiv Detail & Related papers (2025-10-09T04:48:49Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations [2.7620215077666557]
Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique.<n>This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training.<n>We then introduce a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering.
arXiv Detail & Related papers (2025-09-22T13:03:53Z)
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers [0.0]
Large language models (LLMs) aligned for safety often exhibit emergent deceptive behaviors.<n>This paper introduces adversarial activation patching, a novel mechanistic interpretability framework.<n>By sourcing activations from "deceptive" prompts, we simulate vulnerabilities and quantify deception rates.
arXiv Detail & Related papers (2025-07-12T21:29:49Z)
A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection [13.109309606764754]
We introduce a plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself.<n>Our method achieves state-of-the-art detection performance with negligible computational overhead.
arXiv Detail & Related papers (2025-05-19T00:48:53Z)
Extreme Miscalibration and the Illusion of Adversarial Robustness [66.29268991629085]
Adversarial Training is often used to increase model robustness. We show that this observed gain in robustness is an illusion of robustness (IOR) We urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations.
arXiv Detail & Related papers (2024-02-27T13:49:12Z)
Improving Adversarial Robustness to Sensitivity and Invariance Attacks with Deep Metric Learning [80.21709045433096]
A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample. We use metric learning to frame adversarial regularization as an optimal transport problem. Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
arXiv Detail & Related papers (2022-11-04T13:54:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.