Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information
- URL: http://arxiv.org/abs/2311.11509v3
- Date: Sun, 18 Feb 2024 06:04:27 GMT
- Title: Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information
- Authors: Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng
Huang, and Viswanathan Swaminathan
- Abstract summary: Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
- Score: 67.78183175605761
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, Large Language Models (LLM) have emerged as pivotal tools in
various applications. However, these models are susceptible to adversarial
prompt attacks, where attackers can carefully curate input strings that mislead
LLMs into generating incorrect or undesired outputs. Previous work has revealed
that with relatively simple yet effective attacks based on discrete
optimization, it is possible to generate adversarial prompts that bypass
moderation and alignment of the models. This vulnerability to adversarial
prompts underscores a significant concern regarding the robustness and
reliability of LLMs. Our work aims to address this concern by introducing a
novel approach to detecting adversarial prompts at a token level, leveraging
the LLM's capability to predict the next token's probability. We measure the
degree of the model's perplexity, where tokens predicted with high probability
are considered normal, and those exhibiting high perplexity are flagged as
adversarial. Additionaly, our method also integrates context understanding by
incorporating neighboring token information to encourage the detection of
contiguous adversarial prompt sequences. To this end, we design two algorithms
for adversarial prompt detection: one based on optimization techniques and
another on Probabilistic Graphical Models (PGM). Both methods are equipped with
efficient solving methods, ensuring efficient adversarial prompt detection. Our
token-level detection result can be visualized as heatmap overlays on the text
sequence, allowing for a clearer and more intuitive representation of which
part of the text may contain adversarial prompts.
Related papers
- Palisade -- Prompt Injection Detection Framework [0.9620910657090188]
Large Language Models are vulnerable to malicious prompt injection attacks.
This paper proposes a novel NLP based approach for prompt injection detection.
It emphasizes accuracy and optimization through a layered input screening process.
arXiv Detail & Related papers (2024-10-28T15:47:03Z) - AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries.
We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots.
We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z) - Detecting, Explaining, and Mitigating Memorization in Diffusion Models [49.438362005962375]
We introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions.
Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step.
Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization.
arXiv Detail & Related papers (2024-07-31T16:13:29Z) - The Adversarial Implications of Variable-Time Inference [47.44631666803983]
We present an approach that exploits a novel side channel in which the adversary simply measures the execution time of the algorithm used to post-process the predictions of the ML model under attack.
We investigate leakage from the non-maximum suppression (NMS) algorithm, which plays a crucial role in the operation of object detectors.
We demonstrate attacks against the YOLOv3 detector, leveraging the timing leakage to successfully evade object detection using adversarial examples, and perform dataset inference.
arXiv Detail & Related papers (2023-09-05T11:53:17Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Learning to Separate Clusters of Adversarial Representations for Robust
Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature.
In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property.
This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.