Poisoned LangChain: Jailbreak LLMs by LangChain
- URL: http://arxiv.org/abs/2406.18122v1
- Date: Wed, 26 Jun 2024 07:21:02 GMT
- Title: Poisoned LangChain: Jailbreak LLMs by LangChain
- Authors: Ziqiu Wang, Jun Liu, Shengkai Zhang, Yang Yang,
- Abstract summary: We propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain.
We tested this method on six different large language models across three major categories of jailbreak issues.
- Score: 9.658883589561915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model's lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively.
Related papers
- Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
Analyzing-based Jailbreak (ABJ) is an effective jailbreak attack method for Large Language Models (LLMs)
ABJ achieves 94.8% Attack Success Rate (ASR) and 1.06 Attack Efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack [86.6931690001357]
Knowledge-to-jailbreak aims to generate jailbreaks from domain knowledge to evaluate the safety of large language models on specialized domains.
We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator.
arXiv Detail & Related papers (2024-06-17T15:59:59Z) - Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack [5.912639903214644]
We introduce a novel jailbreak attack called Crescendo.
Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner.
We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b Chat, and Anthropic Chat.
arXiv Detail & Related papers (2024-04-02T10:45:49Z) - Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models [29.312244478583665]
generative AI has enabled ubiquitous access to large language models (LLMs)
Jailbreak prompts have emerged as one of the most effective mechanisms to circumvent security restrictions and elicit harmful content originally designed to be prohibited.
We show that users often succeeded in jailbreak prompts generation regardless of their expertise in LLMs.
We also develop a system using AI as the assistant to automate the process of jailbreak prompt generation.
arXiv Detail & Related papers (2024-03-26T02:47:42Z) - Tastle: Distract Large Language Models for Automatic Jailbreak Attack [9.137714258654842]
We propose a black-box jailbreak framework for automated red teaming of large language models (LLMs)
Our framework is superior in terms of effectiveness, scalability and transferability.
We also evaluate the effectiveness of existing jailbreak defense methods against our attack.
arXiv Detail & Related papers (2024-03-13T11:16:43Z) - Foot In The Door: Understanding Large Language Model Jailbreaking via
Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs)
We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z) - Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization [98.18718484152595]
We propose to integrate goal prioritization at both training and inference stages to counteract the intrinsic conflict between the goals of being helpful and ensuring safety.
Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety.
arXiv Detail & Related papers (2023-11-15T16:42:29Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.