Related papers: StruQ: Defending Against Prompt Injection with Structured Queries

StruQ: Defending Against Prompt Injection with Structured Queries

URL: http://arxiv.org/abs/2402.06363v2
Date: Wed, 25 Sep 2024 19:48:39 GMT
Title: StruQ: Defending Against Prompt Injection with Structured Queries
Authors: Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner,
Abstract summary: Large Language Models (LLMs) can perform text-based tasks by utilizing their advanced language understanding capabilities. Prompt injection attacks are an important threat, they trick the model into deviating from the original application's instructions and instead follow user directives. We introduce structured queries, a general approach to tackle this problem. Our system significantly improves resistance to prompt injection attacks, with little or no impact on utility.
Score: 10.22774624798198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Large Language Models (LLMs) enable exciting LLM-integrated applications, which perform text-based tasks by utilizing their advanced language understanding capabilities. However, as LLMs have improved, so have the attacks against them. Prompt injection attacks are an important threat: they trick the model into deviating from the original application's instructions and instead follow user directives. These attacks rely on the LLM's ability to follow instructions and inability to separate prompts and user data. We introduce structured queries, a general approach to tackle this problem. Structured queries separate prompts and data into two channels. We implement a system that supports structured queries. This system is made of (1) a secure front-end that formats a prompt and user data into a special format, and (2) a specially trained LLM that can produce high-quality outputs from these inputs. The LLM is trained using a novel fine-tuning strategy: we convert a base (non-instruction-tuned) LLM to a structured instruction-tuned model that will only follow instructions in the prompt portion of a query. To do so, we augment standard instruction tuning datasets with examples that also include instructions in the data portion of the query, and fine-tune the model to ignore these. Our system significantly improves resistance to prompt injection attacks, with little or no impact on utility. Our code is released at https://github.com/Sizhe-Chen/StruQ.

Related papers

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks. We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
ASIDE: Architectural Separation of Instructions and Data in Language Models [87.16417239344285]
ASIDE allows language models to clearly separate instructions and data at the level of embeddings.<n>We demonstrate experimentally across a range of models, instruction-tuning LLMs with ASIDE leads to highly increased instruction-data separation without a loss in model utility.<n>We provide insights into the mechanism underlying our method through an analysis of the model representations.
arXiv Detail & Related papers (2025-03-13T17:17:17Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
A test suite of prompt injection attacks for LLM-based machine translation [4.459306403129608]
LLM-based NLP systems typically work by embedding their input data into prompt templates which contain instructions and/or in-context examples. Recently, Sun and Miceli-Barone proposed a class of PIAs against LLM-based machine translation. We extend this approach to all the language pairs of the WMT 2024 General Machine Translation task.
arXiv Detail & Related papers (2024-10-07T14:01:20Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions [21.76697662025996]
LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. We propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.
arXiv Detail & Related papers (2024-04-19T22:55:23Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Jatmo: Prompt Injection Defense by Task-Specific Finetuning [8.213552455778743]
Jatmo is a method for generating task-specific models resilient to prompt-injection attacks. It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model. Experiments show that Jatmo models provide similar quality of outputs on their specific task as standard LLMs, while being resilient to prompt injections.
arXiv Detail & Related papers (2023-12-29T16:37:53Z)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.