Related papers: Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

URL: http://arxiv.org/abs/2410.04190v1
Date: Sat, 5 Oct 2024 15:10:01 GMT
Title: Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
Authors: Yiting Dong, Guobin Shen, Dongcheng Zhao, Xiang He, Yi Zeng,
Abstract summary: Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources.
Score: 8.024771725860127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task - a Character Map lookup and decoding process - before presenting the target instruction. By saturating the model's processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition.

Related papers

Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content. adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols. We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts. We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z)
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs) In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z)
A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z)
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security. We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze. Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
We propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ)<n>ABJ comprises two independent attack paths, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms.<n>Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats. This paper presents an innovative defensive strategy, given white box access to an LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
Protecting Your LLMs with Information Bottleneck [20.870610473199125]
We introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle. The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor. Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts.
arXiv Detail & Related papers (2024-04-22T08:16:07Z)
Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs) We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks. Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Attack Prompt Generation for Red Teaming and Defending Large Language Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.