Round Trip Translation Defence against Large Language Model Jailbreaking
Attacks
- URL: http://arxiv.org/abs/2402.13517v1
- Date: Wed, 21 Feb 2024 03:59:52 GMT
- Title: Round Trip Translation Defence against Large Language Model Jailbreaking
Attacks
- Authors: Canaan Yung, Hadi Mohaghegh Dolatabadi, Sarah Erfani, Christopher
Leckie
- Abstract summary: We propose the Round Trip Translation (RTT) method to defend against social-engineered attacks on large language models (LLMs)
RTT paraphrases the adversarial prompt and generalizes the idea conveyed, making it easier for LLMs to detect induced harmful behavior.
We are the first to attempt mitigating the MathsAttack and reduced its attack success rate by almost 40%.
- Score: 12.664577378692703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are susceptible to social-engineered attacks
that are human-interpretable but require a high level of comprehension for LLMs
to counteract. Existing defensive measures can only mitigate less than half of
these attacks at most. To address this issue, we propose the Round Trip
Translation (RTT) method, the first algorithm specifically designed to defend
against social-engineered attacks on LLMs. RTT paraphrases the adversarial
prompt and generalizes the idea conveyed, making it easier for LLMs to detect
induced harmful behavior. This method is versatile, lightweight, and
transferrable to different LLMs. Our defense successfully mitigated over 70% of
Prompt Automatic Iterative Refinement (PAIR) attacks, which is currently the
most effective defense to the best of our knowledge. We are also the first to
attempt mitigating the MathsAttack and reduced its attack success rate by
almost 40%. Our code is publicly available at
https://github.com/Cancanxxx/Round_Trip_Translation_Defence
Related papers
- Denial-of-Service Poisoning Attacks against Large Language Models [64.77355353440691]
LLMs are vulnerable to denial-of-service (DoS) attacks, where spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token.
We propose poisoning-based DoS attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit.
arXiv Detail & Related papers (2024-10-14T17:39:31Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated.
We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting.
Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked [19.242818141154086]
Large language models (LLMs) are popular for high-quality text generation.
LLMs can produce harmful content even when aligned with human values.
We propose LLM Self Defense, a simple approach to defend against these attacks.
arXiv Detail & Related papers (2023-08-14T17:54:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.