Related papers: FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

URL: http://arxiv.org/abs/2412.07672v1
Date: Tue, 10 Dec 2024 17:02:28 GMT
Title: FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
Authors: Bocheng Chen, Hanqing Guo, Qiben Yan,
Abstract summary: Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content.<n>We propose a moving target defense approach that alters decoding hyper parameters to enhance model robustness.<n>Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested.
Score: 7.31505609352525
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content through manipulated prompts, known as jailbreak attacks. Although many defense strategies have been proposed, they often require access to the model's internal structure or need additional training, which is impractical for service providers using LLM APIs, such as OpenAI APIs or Claude APIs. In this paper, we propose a moving target defense approach that alters decoding hyperparameters to enhance model robustness against various jailbreak attacks. Our approach does not require access to the model's internal structure and incurs no additional training costs. The proposed defense includes two key components: (1) optimizing the decoding strategy by identifying and adjusting decoding hyperparameters that influence token generation probabilities, and (2) transforming the decoding hyperparameters and model system prompts into dynamic targets, which are continuously altered during each runtime. By continuously modifying decoding strategies and prompts, the defense effectively mitigates the existing attacks. Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested when using LLMs as black-box APIs. Moreover, our defense offers lower inference costs and maintains comparable response quality, making it a potential layer of protection when used alongside other defense methods.

Related papers

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models [4.757470449749876]
This paper investigates the use of multi-agent LLM systems as a defence against jailbreaking attacks.<n>We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB.<n>Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives.
arXiv Detail & Related papers (2025-06-30T07:29:07Z)
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs [13.54228868302755]
ArrAttack is an attack method designed to target defended large language models (LLMs)<n>ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures.<n>Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.
arXiv Detail & Related papers (2025-05-23T08:02:38Z)
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification [18.006622965818856]
We introduce DETAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs. Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks. During inference, we reallocate attention to emphasize the user's core intention, minimizing interference from attack tokens.
arXiv Detail & Related papers (2025-04-18T09:02:12Z)
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts. We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z)
Jailbreaking? One Step Is Enough! [6.142918017301964]
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. We propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense" intention. To enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples.
arXiv Detail & Related papers (2024-12-17T07:33:41Z)
HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF) HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins. It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z)
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend. We empirically validate using the commonly used GPT-3.5/4 models across all major jailbreak attacks. These models outperform six state-of-the-art defenses and match the performance of GPT-4-based SelfDefend.
arXiv Detail & Related papers (2024-06-08T15:45:31Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks [17.22989422489567]
Large language models (LLMs) are vulnerable to adversarial attacks or jailbreaking. We propose an optimization-based objective for defending LLMs against jailbreaking attacks and an algorithm to create robust system-level defenses. Our results show improved robustness to both jailbreaks seen during optimization and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench.
arXiv Detail & Related papers (2024-01-30T18:56:08Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
Attack Prompt Generation for Red Teaming and Defending Large Language Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
Baseline Defenses for Adversarial Attacks Against Aligned Language Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses. We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.