Related papers: DROJ: A Prompt-Driven Attack against Large Language Models

DROJ: A Prompt-Driven Attack against Large Language Models

URL: http://arxiv.org/abs/2411.09125v1
Date: Thu, 14 Nov 2024 01:48:08 GMT
Title: DROJ: A Prompt-Driven Attack against Large Language Models
Authors: Leyang Hu, Boran Wang,
Abstract summary: Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks. We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model's responses. Our code is available at https://github.com/Leon-Leyang/LLM-Safeguard.

Related papers

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models [60.201244463046784]
Large language models are vulnerable to jailbreak attacks.<n>This paper studies black-box multi-turn jailbreaks, aiming to train attacker LLMs to elicit harmful content from black-box models.
arXiv Detail & Related papers (2025-12-08T17:42:59Z)
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models [17.860698041523918]
Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs)<n>We propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query.<n> RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates.
arXiv Detail & Related papers (2025-07-07T17:56:05Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks. This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions. Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z)
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process [23.66988994636578]
We introduce RePD, an innovative attack framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs) RePD operates on a one-shot learning model, wherein it accesses a database of jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts. We have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
arXiv Detail & Related papers (2024-10-11T09:39:11Z)
MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs [14.1985036536366]
We propose RL-JACK, a novel black-box jailbreaking attack powered by deep reinforcement learning (DRL) Our method includes a series of customized designs to enhance the RL agent's learning efficiency in the jailbreaking context. We demonstrate that RL-JACK is overall much more effective than existing jailbreaking attacks against six SOTA LLMs.
arXiv Detail & Related papers (2024-06-13T01:05:22Z)
Towards Universal and Black-Box Query-Response Only Attack on LLMs with QROA [2.4578723416255754]
We introduce QROA, a black-box jailbreak method to identify adversarial suffixes when appended to a malicious instruction.<n>Unlike existing suffix-based jailbreak approaches, QROA does not require access to the model's logit or any other internal information.<n>We also propose QROA-UNV, an extension that identifies universal adversarial suffixes for individual models.
arXiv Detail & Related papers (2024-06-04T07:27:36Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models. AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.