FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
- URL: http://arxiv.org/abs/2412.07672v1
- Date: Tue, 10 Dec 2024 17:02:28 GMT
- Title: FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
- Authors: Bocheng Chen, Hanqing Guo, Qiben Yan,
- Abstract summary: Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content.
We propose a moving target defense approach that alters decoding hyper parameters to enhance model robustness.
Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested.
- Score: 7.31505609352525
- License:
- Abstract: Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content through manipulated prompts, known as jailbreak attacks. Although many defense strategies have been proposed, they often require access to the model's internal structure or need additional training, which is impractical for service providers using LLM APIs, such as OpenAI APIs or Claude APIs. In this paper, we propose a moving target defense approach that alters decoding hyperparameters to enhance model robustness against various jailbreak attacks. Our approach does not require access to the model's internal structure and incurs no additional training costs. The proposed defense includes two key components: (1) optimizing the decoding strategy by identifying and adjusting decoding hyperparameters that influence token generation probabilities, and (2) transforming the decoding hyperparameters and model system prompts into dynamic targets, which are continuously altered during each runtime. By continuously modifying decoding strategies and prompts, the defense effectively mitigates the existing attacks. Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested when using LLMs as black-box APIs. Moreover, our defense offers lower inference costs and maintains comparable response quality, making it a potential layer of protection when used alongside other defense methods.
Related papers
- JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs [11.924542310342282]
We present JailPO, a novel black-box jailbreak framework to examine Large Language Models (LLMs) alignment.
For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts.
We also introduce a preference optimization-based attack method to enhance the jailbreak effectiveness.
arXiv Detail & Related papers (2024-12-20T07:29:10Z) - Jailbreaking? One Step Is Enough! [6.142918017301964]
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs.
We propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the "defense" intention.
To enhance the model's confidence and guidance in "defensive" intentions, we adopt in-context learning (ICL) with a small number of attack examples.
arXiv Detail & Related papers (2024-12-17T07:33:41Z) - HSF: Defending against Jailbreak Attacks with Hidden State Filtering [14.031010511732008]
We propose a jailbreak attack defense strategy based on a Hidden State Filter (HSF)
HSF enables the model to preemptively identify and reject adversarial inputs before the inference process begins.
It significantly reduces the success rate of jailbreak attacks while minimally impacting responses to benign user queries.
arXiv Detail & Related papers (2024-08-31T06:50:07Z) - SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.
We empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks.
To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models.
arXiv Detail & Related papers (2024-06-08T15:45:31Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.