Jatmo: Prompt Injection Defense by Task-Specific Finetuning
- URL: http://arxiv.org/abs/2312.17673v2
- Date: Mon, 8 Jan 2024 19:11:26 GMT
- Title: Jatmo: Prompt Injection Defense by Task-Specific Finetuning
- Authors: Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei,
Elizabeth Sun, Basel Alomair, and David Wagner
- Abstract summary: Jatmo is a method for generating task-specific models resilient to prompt-injection attacks.
It harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model.
Experiments show that Jatmo models provide similar quality of outputs on their specific task as standard LLMs, while being resilient to prompt injections.
- Score: 8.213552455778743
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) are attracting significant research attention
due to their instruction-following abilities, allowing users and developers to
leverage LLMs for a variety of tasks. However, LLMs are vulnerable to
prompt-injection attacks: a class of attacks that hijack the model's
instruction-following abilities, changing responses to prompts to undesired,
possibly malicious ones. In this work, we introduce Jatmo, a method for
generating task-specific models resilient to prompt-injection attacks. Jatmo
leverages the fact that LLMs can only follow instructions once they have
undergone instruction tuning. It harnesses a teacher instruction-tuned model to
generate a task-specific dataset, which is then used to fine-tune a base model
(i.e., a non-instruction-tuned model). Jatmo only needs a task prompt and a
dataset of inputs for the task: it uses the teacher model to generate outputs.
For situations with no pre-existing datasets, Jatmo can use a single example,
or in some cases none at all, to produce a fully synthetic dataset. Our
experiments on seven tasks show that Jatmo models provide similar quality of
outputs on their specific task as standard LLMs, while being resilient to
prompt injections. The best attacks succeeded in less than 0.5% of cases
against our models, versus 87% success rate against GPT-3.5-Turbo. We release
Jatmo at https://github.com/wagner-group/prompt-injection-defense.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels.
We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions.
Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - StruQ: Defending Against Prompt Injection with Structured Queries [10.22774624798198]
Large Language Models (LLMs) can perform text-based tasks by utilizing their advanced language understanding capabilities.
Prompt injection attacks are an important threat, they trick the model into deviating from the original application's instructions and instead follow user directives.
We introduce structured queries, a general approach to tackle this problem.
Our system significantly improves resistance to prompt injection attacks, with little or no impact on utility.
arXiv Detail & Related papers (2024-02-09T12:15:51Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.