Related papers: The bitter lesson of misuse detection

The bitter lesson of misuse detection

URL: http://arxiv.org/abs/2507.06282v1
Date: Tue, 08 Jul 2025 15:21:17 GMT
Title: The bitter lesson of misuse detection
Authors: Hadrien Mariaccia, Charbel-Raphaël Segerie, Diego Dorn,
Abstract summary: We introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems.<n>Bells framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak)<n>Our evaluations reveal drastic limitations of specialized supervision systems.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse attacks. To address this, we introduce BELLS, a Benchmark for the Evaluation of LLM Supervision Systems. The framework is two dimensional: harm severity (benign, borderline, harmful) and adversarial sophistication (direct vs. jailbreak) and provides a rich dataset covering 3 jailbreak families and 11 harm categories. Our evaluations reveal drastic limitations of specialized supervision systems. While they recognize some known jailbreak patterns, their semantic understanding and generalization capabilities are very limited, sometimes with detection rates close to zero when asking a harmful question directly or with a new jailbreak technique such as base64 encoding. Simply asking generalist LLMs if the user question is "harmful or not" largely outperforms these supervisors from the market according to our BELLS score. But frontier LLMs still suffer from metacognitive incoherence, often responding to queries they correctly identify as harmful (up to 30 percent for Claude 3.7 and greater than 50 percent for Mistral Large). These results suggest that simple scaffolding could significantly improve misuse detection robustness, but more research is needed to assess the tradeoffs of such techniques. Our results support the "bitter lesson" of misuse detection: general capabilities of LLMs are necessary to detect a diverse array of misuses and jailbreaks.

Related papers

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges [70.85114705489222]
We propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation.<n>M MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.<n>Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities.
arXiv Detail & Related papers (2025-06-09T12:02:39Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks [1.3101678989725927]
It is challenging to explain the reason behind the malicious nature of the jailbreak.<n>We propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts.<n>Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input.
arXiv Detail & Related papers (2025-04-10T22:29:23Z)
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model [25.204224437843365]
Multimodal large language models (MLLMs) excel in vision-language tasks but pose significant risks of generating harmful content.<n>Jailbreak attacks refer to intentional manipulations that bypass safety mechanisms in models, leading to the generation of inappropriate or unsafe content.<n>We introduce a test-time adaptive framework called JAILDAM to address these issues.
arXiv Detail & Related papers (2025-04-03T05:00:28Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models [55.253208152184065]
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text.<n>We conduct a detailed analysis of seven different jailbreak methods and find that disagreements stem from insufficient observation samples.<n>We propose a novel defense called textbfActivation Boundary Defense (ABD), which adaptively constrains the activations within the safety boundary.
arXiv Detail & Related papers (2024-12-22T14:18:39Z)
LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds [98.20826635707341]
Jailbreak attacks expose vulnerabilities in safety-aligned LLMs by eliciting harmful outputs through carefully crafted prompts.<n>We frame jailbreaks as inference-time misalignment and introduce LIAR, a fast, black-box, best-of-$N$ sampling attack requiring no training.<n>We also introduce a theoretical "safety net against jailbreaks" metric to quantify safety alignment strength and derive suboptimality bounds.
arXiv Detail & Related papers (2024-12-06T18:02:59Z)
HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)<n>HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.<n>It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs [13.317364896194903]
Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner. LLMs are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs.
arXiv Detail & Related papers (2024-06-13T17:01:40Z)
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs [26.981225219312627]
We present a large-scale evaluation of various jailbreak attacks.<n>We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy.
arXiv Detail & Related papers (2024-02-08T13:42:50Z)
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks [12.540530764250812]
We propose a formalism and a taxonomy of known (and possible) jailbreaks. We release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks.
arXiv Detail & Related papers (2023-05-24T09:57:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.