Related papers: From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

URL: http://arxiv.org/abs/2505.24232v1
Date: Fri, 30 May 2025 05:48:50 GMT
Title: From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
Authors: Haibo Jin, Peiyan Zhang, Peiran Wang, Man Luo, Haohan Wang,
Abstract summary: Large foundation models (LFMs) are susceptible to hallucinations and jailbreak attacks.<n>We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization.
Score: 16.066501870239073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.

Related papers

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs [61.916827858666906]
We reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping.<n>We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning.<n>We propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling.
arXiv Detail & Related papers (2025-07-06T12:19:04Z)
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks [9.277492743469235]
We present the first systematic jailbreak evaluation of DeepSeek-series models.<n>We compare them with GPT-3.5 and GPT-4 using the HarmBench benchmark.
arXiv Detail & Related papers (2025-06-23T11:53:31Z)
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation [18.37303422539757]
We investigate the vulnerability of intent-aware guardrails and demonstrate that large language models exhibit implicit intent detection capabilities.<n>We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives.<n>Our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses.
arXiv Detail & Related papers (2025-05-24T06:47:32Z)
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 [24.599707290204524]
Transfer-based targeted attacks on large vision-language models (LVLMs) often fail against black-box commercial LVLMs.<n>We propose an approach that refines semantic clarity by encoding explicit semantic details within local regions.<n>Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods.
arXiv Detail & Related papers (2025-03-13T17:59:55Z)
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models [16.5022773312661]
We propose a universal certified defence framework to safeguard large vision-language models against jailbreak attacks.<n>First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses.<n>Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees.
arXiv Detail & Related papers (2025-03-08T17:33:55Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.<n>Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Unveiling Vulnerabilities of Contrastive Recommender Systems to Poisoning Attacks [48.911832772464145]
Contrastive learning (CL) has recently gained prominence in the domain of recommender systems. This paper identifies a vulnerability of CL-based recommender systems that they are more susceptible to poisoning attacks aiming to promote individual items.
arXiv Detail & Related papers (2023-11-30T04:25:28Z)
Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks. We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.