Poisoned LangChain: Jailbreak LLMs by LangChain
- URL: http://arxiv.org/abs/2406.18122v1
- Date: Wed, 26 Jun 2024 07:21:02 GMT
- Title: Poisoned LangChain: Jailbreak LLMs by LangChain
- Authors: Ziqiu Wang, Jun Liu, Shengkai Zhang, Yang Yang,
- Abstract summary: We propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain.
We tested this method on six different large language models across three major categories of jailbreak issues.
- Score: 9.658883589561915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model's lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively.
Related papers
- xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.
We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)
We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, MLLMs remain vulnerable to jailbreak attacks.
We propose Immune, an inference-time defense framework that leverages a safe reward model to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z) - SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
We introduce a novel jailbreak method, which targets the external properties of LLMs.
By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.
We propose a simple defense method called Self-Reminder-Key to counter SIJ.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks.
IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model.
It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks.
This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions.
Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG)
PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output.
We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z) - h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment [48.5611060845958]
We propose a novel benchmark of composable jailbreak attacks to move beyond static datasets and of attacks and harms.
We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs.
Several of our synthesized attacks are more effective than previously reported ones, with Attack Success rates exceeding 90% on SOTA closed language models.
arXiv Detail & Related papers (2024-08-09T01:45:39Z) - Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.
We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.
Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.