Related papers: Jailbreaking Large Language Models in Infinitely Many Ways

Jailbreaking Large Language Models in Infinitely Many Ways

URL: http://arxiv.org/abs/2501.10800v1
Date: Sat, 18 Jan 2025 15:39:53 GMT
Title: Jailbreaking Large Language Models in Infinitely Many Ways
Authors: Oliver Goldstein, Emanuele La Malfa, Felix Drinkall, Samuele Marro, Michael Wooldridge,
Abstract summary: We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies.<n>For two categories of attacks that are straightforward to implement, we discuss two defensive strategies, one in token and the other in embedding space.
Score: 3.5674816606221182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We discuss the "Infinitely Many Meanings" attacks (IMM), a category of jailbreaks that leverages the increasing capabilities of a model to handle paraphrases and encoded communications to bypass their defensive mechanisms. IMMs' viability pairs and grows with a model's capabilities to handle and bind the semantics of simple mappings between tokens and work extremely well in practice, posing a concrete threat to the users of the most powerful LLMs in commerce. We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies. One can protect against IMMs by improving the guardrails and making them scale with the LLMs' capabilities. For two categories of attacks that are straightforward to implement, i.e., bijection and encoding, we discuss two defensive strategies, one in token and the other in embedding space. We conclude with some research questions we believe should be prioritised to enhance the defensive mechanisms of LLMs and our understanding of their safety.

Related papers

LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution [84.2846064139183]
Large Language Models (LLMs) face threats from jailbreak prompts. We propose LightDefense, a lightweight defense mechanism targeted at white-box models.
arXiv Detail & Related papers (2025-04-02T09:21:26Z)
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models [8.024771725860127]
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources.
arXiv Detail & Related papers (2024-10-05T15:10:01Z)
Endless Jailbreaks with Bijection Learning [3.5963161678592828]
We introduce a powerful attack algorithm which fuzzes LLMs for safety vulnerabilities using randomly-generated encodings.<n>Our attack is extremely effective on a wide range of frontier language models.
arXiv Detail & Related papers (2024-10-02T07:40:56Z)
Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks. "jailbreaking" induces the model to generate malicious responses against the usage policy and society. We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks. Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query. We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.