Related papers: False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

URL: http://arxiv.org/abs/2509.03888v1
Date: Thu, 04 Sep 2025 05:15:55 GMT
Title: False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Authors: Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen,
Abstract summary: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities.<n>Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations.<n>Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness.
Score: 30.448801772258644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

Related papers

Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection [4.514361164656055]
We introduce a taxonomy of ten categories of hidden intentions, organised by intent, mechanism, context, and impact.<n>We systematically assess detection methods, including reasoning and non-reasoning LLM judges.<n>We find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions.
arXiv Detail & Related papers (2026-01-26T14:59:17Z)
NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks [8.416892421891761]
Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines.<n>This work introduces a semantic consistency analysis between successful and unsuccessful responses, demonstrating that a negation-aware scoring approach captures meaningful patterns.<n>A novel detection framework called NegBLEURT Forest is proposed to evaluate the degree of alignment between outputs elicited by adversarial prompts and expected safe behaviors.<n>It identifies anomalous responses using the Isolation Forest algorithm, enabling reliable jailbreak detection.
arXiv Detail & Related papers (2025-11-14T14:43:54Z)
Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z)
On Evaluating Performance of LLM Inference Serving Systems [11.712948114304925]
We identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation setup, and Metric Design.<n>These anti-patterns are uniquely problematic for Large Language Model (LLM) inference due to its dual-phase nature.<n>We provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns.
arXiv Detail & Related papers (2025-07-11T20:58:21Z)
LLM Performance for Code Generation on Noisy Tasks [0.41942958779358674]
We show that large language models (LLMs) can solve tasks obfuscated to a level where the text would be unintelligible to human readers.<n>We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets.<n>We propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination.
arXiv Detail & Related papers (2025-05-29T16:11:18Z)
A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation [93.28532038721816]
Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields.<n>We propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples.
arXiv Detail & Related papers (2025-04-11T10:18:13Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Good-looking but Lacking Faithfulness: Understanding Local Explanation Methods through Trend-based Testing [13.076171586649528]
We evaluate the faithfulness of explanation methods and find that traditional tests on faithfulness encounter the random dominance problem. Benefiting from the trend tests, we successfully assess the explanation methods on complex data for the first time.
arXiv Detail & Related papers (2023-09-09T14:44:39Z)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.