Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of
LLMs through a Global Scale Prompt Hacking Competition
- URL: http://arxiv.org/abs/2311.16119v3
- Date: Sun, 3 Mar 2024 00:12:16 GMT
- Title: Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of
LLMs through a Global Scale Prompt Hacking Competition
- Authors: Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-Fran\c{c}ois
Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost,
Christopher Carnahan, Jordan Boyd-Graber
- Abstract summary: Large Language Models are vulnerable to prompt injection and jailbreaking.
We launch a global prompt hacking competition, which allows for free-form human input attacks.
We elicit 600K+ adversarial prompts against three state-of-the-art LLMs.
- Score: 8.560772603154545
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are deployed in interactive contexts with direct
user engagement, such as chatbots and writing assistants. These deployments are
vulnerable to prompt injection and jailbreaking (collectively, prompt hacking),
in which models are manipulated to ignore their original instructions and
follow potentially malicious ones. Although widely acknowledged as a
significant security threat, there is a dearth of large-scale resources and
quantitative studies on prompt hacking. To address this lacuna, we launch a
global prompt hacking competition, which allows for free-form human input
attacks. We elicit 600K+ adversarial prompts against three state-of-the-art
LLMs. We describe the dataset, which empirically verifies that current LLMs can
indeed be manipulated via prompt hacking. We also present a comprehensive
taxonomical ontology of the types of adversarial prompts.
Related papers
- Embedding-based classifiers can detect prompt injection attacks [5.820776057182452]
Large Language Models (LLMs) are vulnerable to adversarial attacks, particularly prompt injection attacks.
We propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat.
arXiv Detail & Related papers (2024-10-29T17:36:59Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Defending Against Indirect Prompt Injection Attacks With Spotlighting [11.127479817618692]
In common applications, multiple inputs can be processed by concatenating them together into a single stream of text.
Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands.
We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input.
arXiv Detail & Related papers (2024-03-20T15:26:23Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers [74.7446827091938]
We introduce an automatic prompt textbfDecomposition and textbfReconstruction framework for jailbreak textbfAttack (DrAttack)
DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b) Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while
arXiv Detail & Related papers (2024-02-25T17:43:29Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - RatGPT: Turning online LLMs into Proxies for Malware Attacks [0.0]
We present a proof-of-concept where ChatGPT is used for the dissemination of malicious software while evading detection.
We also present the general approach as well as essential elements in order to stay undetected and make the attack a success.
arXiv Detail & Related papers (2023-08-17T20:54:39Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.