Related papers: Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs

URL: http://arxiv.org/abs/2508.16347v2
Date: Mon, 15 Sep 2025 03:59:27 GMT
Title: Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
Authors: Yu Yan, Sheng Sun, Zhe Wang, Yijun Lin, Zenghao Duan, zhifei zheng, Min Liu, Zhiyi yin, Jianping Zhang,
Abstract summary: This study investigates the misuse threats of Large Language Models (LLMs) in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment.<n> Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns.
Score: 16.95831588112687
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the development of Large Language Models (LLMs), numerous efforts have revealed their vulnerabilities to jailbreak attacks. Although these studies have driven the progress in LLMs' safety alignment, it remains unclear whether LLMs have internalized authentic knowledge to deal with real-world crimes, or are merely forced to simulate toxic language patterns. This ambiguity raises concerns that jailbreak success is often attributable to a hallucination loop between jailbroken LLM and judger LLM. By decoupling the use of jailbreak techniques, we construct knowledge-intensive Q\&A to investigate the misuse threats of LLMs in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment robustness. Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns. Our study reveals a gap between existing LLM safety assessments and real-world threat potential.

Related papers

ShallowJail: Steering Jailbreaks against Large Language Models [7.9152592631238425]
We introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs.<n>ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference.<n>Through extensive experiments, we demonstrate the effectiveness ofshallow, which substantially degrades the safety of state-of-the-art LLM responses.
arXiv Detail & Related papers (2026-02-06T18:35:38Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [65.45603614354329]
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content. We employ weak classifiers to explain LLM safety through the intermediate hidden states.
arXiv Detail & Related papers (2024-06-09T05:04:37Z)
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak [62.56901628534646]
This paper focuses on jailbreaking attacks against large language models (LLMs)<n>Our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2024-05-30T12:50:32Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts [13.176057229119408]
Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention. We propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts.
arXiv Detail & Related papers (2024-04-12T08:08:44Z)
Distract Large Language Models for Automatic Jailbreak Attack [8.364590541640482]
We propose a novel black-box jailbreak framework for automated red teaming of Large language models. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs.
arXiv Detail & Related papers (2024-03-13T11:16:43Z)
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues [16.97760778679782]
We propose an indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense strategy and obtain malicious response. Our experiments show that Puzzler achieves a query success rate of 96.6% on closed-source LLMs. When tested against the state-of-the-art jailbreak detection approaches, Puzzler proves to be more effective at evading detection compared to baselines.
arXiv Detail & Related papers (2024-02-14T11:11:51Z)
Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective [43.94115802328438]
Recent surge in jailbreaking attacks has revealed significant vulnerabilities in Large Language Models (LLMs) when exposed to malicious inputs.<n>We suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space.<n>Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries.
arXiv Detail & Related papers (2024-01-12T00:50:04Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.