Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
- URL: http://arxiv.org/abs/2406.06737v1
- Date: Mon, 10 Jun 2024 18:57:22 GMT
- Title: Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
- Authors: Junlin Wang, Tianyi Yang, Roy Xie, Bhuwan Dhingra,
- Abstract summary: We introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks.
Our novel evaluation method assesses models under both defenseless and defended scenarios.
Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected.
- Score: 8.51254190797079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at https://github.com/M0gician/RaccoonBench.
Related papers
- Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.
We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts.
We present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - PuriDefense: Randomized Local Implicit Adversarial Purification for
Defending Black-box Query-based Attacks [15.842917276255141]
Black-box query-based attacks threaten Machine Learning as a Service (ML) systems.
We propose an efficient defense mechanism, PuriDefense, that employs random patch-wise purifications with an ensemble of lightweight purification models at a low level of inference cost.
Our theoretical analysis suggests that this approach slows down the convergence of query-based attacks by incorporating randomness into purifications.
arXiv Detail & Related papers (2024-01-19T09:54:23Z) - A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models [0.0]
This study introduces a novel framework for quantifying the resilience of applications.
The framework incorporates innovative techniques designed to ensure representativeness, interpretability, and robustness.
Results revealed that Llama2, the newer model exhibited higher resilience compared to ChatGLM.
arXiv Detail & Related papers (2024-01-02T02:06:48Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on
Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks.
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks.
We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - I Know What You Trained Last Summer: A Survey on Stealing Machine
Learning Models and Defences [0.1031296820074812]
We study model stealing attacks, assessing their performance and exploring corresponding defence techniques in different settings.
We propose a taxonomy for attack and defence approaches, and provide guidelines on how to select the right attack or defence based on the goal and available resources.
arXiv Detail & Related papers (2022-06-16T21:16:41Z) - Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised
Learning [71.17774313301753]
We explore the robustness of self-supervised learned high-level representations by using them in the defense against adversarial attacks.
Experimental results on the ASVspoof 2019 dataset demonstrate that high-level representations extracted by Mockingjay can prevent the transferability of adversarial examples.
arXiv Detail & Related papers (2020-06-05T03:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.