A Cross-Language Investigation into Jailbreak Attacks in Large Language
Models
- URL: http://arxiv.org/abs/2401.16765v1
- Date: Tue, 30 Jan 2024 06:04:04 GMT
- Title: A Cross-Language Investigation into Jailbreak Attacks in Large Language
Models
- Authors: Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng,
Yang Liu, Yinxing Xue
- Abstract summary: A particularly underexplored area is the Multilingual Jailbreak attack.
There is a lack of comprehensive empirical studies addressing this specific threat.
This study provides valuable insights into understanding and mitigating Multilingual Jailbreak attacks.
- Score: 14.226415550366504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have become increasingly popular for their
advanced text generation capabilities across various domains. However, like any
software, they face security challenges, including the risk of 'jailbreak'
attacks that manipulate LLMs to produce prohibited content. A particularly
underexplored area is the Multilingual Jailbreak attack, where malicious
questions are translated into various languages to evade safety filters.
Currently, there is a lack of comprehensive empirical studies addressing this
specific threat.
To address this research gap, we conducted an extensive empirical study on
Multilingual Jailbreak attacks. We developed a novel semantic-preserving
algorithm to create a multilingual jailbreak dataset and conducted an
exhaustive evaluation on both widely-used open-source and commercial LLMs,
including GPT-4 and LLaMa. Additionally, we performed interpretability analysis
to uncover patterns in Multilingual Jailbreak attacks and implemented a
fine-tuning mitigation method. Our findings reveal that our mitigation strategy
significantly enhances model defense, reducing the attack success rate by
96.2%. This study provides valuable insights into understanding and mitigating
Multilingual Jailbreak attacks.
Related papers
- Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
Analyzing-based Jailbreak (ABJ) is an effective jailbreak attack method for Large Language Models (LLMs)
ABJ achieves 94.8% Attack Success Rate (ASR) and 1.06 Attack Efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning [63.481446315733145]
Our research focuses on cross-lingual backdoor attacks against multilingual models.
We investigate how poisoning the instruction-tuning data in one or two languages can affect the outputs in languages whose instruction-tuning data was not poisoned.
Our method exhibits remarkable efficacy in models like mT5, BLOOM, and GPT-3.5-turbo, with high attack success rates, surpassing 95% in several languages.
arXiv Detail & Related papers (2024-04-30T14:43:57Z) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs.
Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates.
We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z) - Text Embedding Inversion Security for Multilingual Language Models [2.790855523145802]
Research shows that text can be reconstructed from embeddings, even without knowledge of the underlying model.
This study is the first to investigate multilingual inversion attacks, shedding light on the differences in attacks and defenses across monolingual and multilingual settings.
arXiv Detail & Related papers (2024-01-22T18:34:42Z) - Analyzing the Inherent Response Tendency of LLMs: Real-World
Instructions-Driven Jailbreak [26.741029482196534]
"Jailbreak Attack" is phenomenon where Large Language Models (LLMs) generate harmful responses when faced with malicious instructions.
We introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses.
Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions.
arXiv Detail & Related papers (2023-12-07T08:29:58Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.