Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- URL: http://arxiv.org/abs/2311.01011v1
- Date: Thu, 2 Nov 2023 06:13:36 GMT
- Title: Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- Authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke
Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor
Darrell, Alan Ritter, Stuart Russell
- Abstract summary: This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
- Score: 86.66627242073724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Large Language Models (LLMs) are increasingly being used in real-world
applications, they remain vulnerable to prompt injection attacks: malicious
third party prompts that subvert the intent of the system designer. To help
researchers study this problem, we present a dataset of over 126,000 prompt
injection attacks and 46,000 prompt-based "defenses" against prompt injection,
all created by players of an online game called Tensor Trust. To the best of
our knowledge, this is currently the largest dataset of human-generated
adversarial examples for instruction-following LLMs. The attacks in our dataset
have a lot of easily interpretable stucture, and shed light on the weaknesses
of LLMs. We also use the dataset to create a benchmark for resistance to two
types of prompt injection, which we refer to as prompt extraction and prompt
hijacking. Our benchmark results show that many models are vulnerable to the
attack strategies in the Tensor Trust dataset. Furthermore, we show that some
attack strategies from the dataset generalize to deployed LLM-based
applications, even though they have a very different set of constraints to the
game. We release all data and source code at https://tensortrust.ai/paper
Related papers
- Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection [6.269725911814401]
Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks.
However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem.
This project explores the security vulnerabilities in relation to prompt injection attacks.
arXiv Detail & Related papers (2024-10-28T00:36:21Z) - Denial-of-Service Poisoning Attacks against Large Language Models [64.77355353440691]
LLMs are vulnerable to denial-of-service (DoS) attacks, where spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token.
We propose poisoning-based DoS attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit.
arXiv Detail & Related papers (2024-10-14T17:39:31Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Instruction Backdoor Attacks Against Customized LLMs [37.92008159382539]
We propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs.
Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness.
We propose two defense strategies and demonstrate their effectiveness in reducing such attacks.
arXiv Detail & Related papers (2024-02-14T13:47:35Z) - Formalizing and Benchmarking Prompt Injection Attacks and Defenses [59.57908526441172]
We propose a framework to formalize prompt injection attacks.
Based on our framework, we design a new attack by combining existing ones.
Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses.
arXiv Detail & Related papers (2023-10-19T15:12:09Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.