Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
- URL: http://arxiv.org/abs/2408.08899v1
- Date: Sun, 11 Aug 2024 20:31:52 GMT
- Title: Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
- Authors: Robert J. Moss,
- Abstract summary: Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models.
This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs.
The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors.
- Score: 1.223779595809275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4$-$which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (https://github.com/sisl/Kov.jl).
Related papers
- Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models [27.397408870544453]
Large Language Models (LLMs) have become a focal point in the rapidly evolving field of artificial intelligence.
A critical concern is the presence of toxic content within the pre-training corpus of these models, which can lead to the generation of inappropriate outputs.
This paper proposes a target-driven attack paradigm that focuses on directly eliciting the target response instead of optimizing the prompts.
arXiv Detail & Related papers (2024-08-27T08:12:08Z) - Tree-of-Traversals: A Zero-Shot Reasoning Algorithm for Augmenting Black-box Language Models with Knowledge Graphs [72.89652710634051]
Knowledge graphs (KGs) complement Large Language Models (LLMs) by providing reliable, structured, domain-specific, and up-to-date external knowledge.
We introduce Tree-of-Traversals, a novel zero-shot reasoning algorithm that enables augmentation of black-box LLMs with one or more KGs.
arXiv Detail & Related papers (2024-07-31T06:01:24Z) - QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated.
This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs [9.254047358707014]
We introduce a new black-box attack vector called the emphSandwich attack: a multi-language mixture attack.
Our experiments with five different models, namely Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses.
arXiv Detail & Related papers (2024-04-09T18:29:42Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated.
We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting.
Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z) - Make Them Spill the Beans! Coercive Knowledge Extraction from
(Production) LLMs [31.80386572346993]
We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits.
This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster.
Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
arXiv Detail & Related papers (2023-12-08T01:41:36Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.