Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
- URL: http://arxiv.org/abs/2406.06737v2
- Date: Sat, 26 Oct 2024 03:01:42 GMT
- Title: Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
- Authors: Junlin Wang, Tianyi Yang, Roy Xie, Bhuwan Dhingra,
- Abstract summary: We introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks.
Our novel evaluation method assesses models under both defenseless and defended scenarios.
Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected.
- Score: 8.51254190797079
- License:
- Abstract: With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at https://github.com/M0gician/RaccoonBench.
Related papers
- The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks.
We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem.
We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z) - CALoR: Towards Comprehensive Model Inversion Defense [43.2642796582236]
Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in released machine learning models.
Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios.
We propose a robust defense mechanism, integrating Confidence Adaptation and Low-Rank compression.
arXiv Detail & Related papers (2024-10-08T08:44:01Z) - DV-FSR: A Dual-View Target Attack Framework for Federated Sequential Recommendation [4.980393474423609]
Federated recommendation (FedRec) preserves user privacy by enabling decentralized training of personalized models, but this architecture is inherently vulnerable to adversarial attacks.
We propose a novel dualview attack framework, named DV-FSR, which combines a sampling-based explicit strategy with a contrastive learning-based implicit gradient strategy to orchestrate a coordinated attack.
arXiv Detail & Related papers (2024-09-10T15:24:13Z) - Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.
We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.
We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models [0.0]
This study introduces a novel framework for quantifying the resilience of applications.
The framework incorporates innovative techniques designed to ensure representativeness, interpretability, and robustness.
Results revealed that Llama2, the newer model exhibited higher resilience compared to ChatGLM.
arXiv Detail & Related papers (2024-01-02T02:06:48Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on
Large Language Models [82.98081731588717]
Integration of large language models with external content exposes applications to indirect prompt injection attacks.
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to evaluate the risk of such attacks.
We develop two black-box methods based on prompt learning and a white-box defense method based on fine-tuning with adversarial training.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised
Learning [71.17774313301753]
We explore the robustness of self-supervised learned high-level representations by using them in the defense against adversarial attacks.
Experimental results on the ASVspoof 2019 dataset demonstrate that high-level representations extracted by Mockingjay can prevent the transferability of adversarial examples.
arXiv Detail & Related papers (2020-06-05T03:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.