Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information
- URL: http://arxiv.org/abs/2311.11509v3
- Date: Sun, 18 Feb 2024 06:04:27 GMT
- Title: Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information
- Authors: Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng
Huang, and Viswanathan Swaminathan
- Abstract summary: Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
- Score: 67.78183175605761
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, Large Language Models (LLM) have emerged as pivotal tools in
various applications. However, these models are susceptible to adversarial
prompt attacks, where attackers can carefully curate input strings that mislead
LLMs into generating incorrect or undesired outputs. Previous work has revealed
that with relatively simple yet effective attacks based on discrete
optimization, it is possible to generate adversarial prompts that bypass
moderation and alignment of the models. This vulnerability to adversarial
prompts underscores a significant concern regarding the robustness and
reliability of LLMs. Our work aims to address this concern by introducing a
novel approach to detecting adversarial prompts at a token level, leveraging
the LLM's capability to predict the next token's probability. We measure the
degree of the model's perplexity, where tokens predicted with high probability
are considered normal, and those exhibiting high perplexity are flagged as
adversarial. Additionaly, our method also integrates context understanding by
incorporating neighboring token information to encourage the detection of
contiguous adversarial prompt sequences. To this end, we design two algorithms
for adversarial prompt detection: one based on optimization techniques and
another on Probabilistic Graphical Models (PGM). Both methods are equipped with
efficient solving methods, ensuring efficient adversarial prompt detection. Our
token-level detection result can be visualized as heatmap overlays on the text
sequence, allowing for a clearer and more intuitive representation of which
part of the text may contain adversarial prompts.
Related papers
- Scalable Ensemble-based Detection Method against Adversarial Attacks for
speaker verification [73.30974350776636]
This paper comprehensively compares mainstream purification techniques in a unified framework.
We propose an easy-to-follow ensemble approach that integrates advanced purification modules for detection.
arXiv Detail & Related papers (2023-12-14T03:04:05Z) - The Adversarial Implications of Variable-Time Inference [47.44631666803983]
We present an approach that exploits a novel side channel in which the adversary simply measures the execution time of the algorithm used to post-process the predictions of the ML model under attack.
We investigate leakage from the non-maximum suppression (NMS) algorithm, which plays a crucial role in the operation of object detectors.
We demonstrate attacks against the YOLOv3 detector, leveraging the timing leakage to successfully evade object detection using adversarial examples, and perform dataset inference.
arXiv Detail & Related papers (2023-09-05T11:53:17Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - TextShield: Beyond Successfully Detecting Adversarial Sentences in Text
Classification [6.781100829062443]
Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications.
Previous detection methods are incapable of giving correct predictions on adversarial sentences.
We propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not.
arXiv Detail & Related papers (2023-02-03T22:58:07Z) - Open-Set Likelihood Maximization for Few-Shot Learning [36.97433312193586]
We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples.
We explore the popular transductive setting, which leverages the unlabelled query instances at inference.
Motivated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a generalization of the maximum likelihood principle.
arXiv Detail & Related papers (2023-01-20T01:56:19Z) - ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation [125.52743832477404]
Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks.
We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection.
Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
arXiv Detail & Related papers (2022-10-22T09:11:12Z) - Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial
Attack Framework [17.17479625646699]
We propose a unified framework to craft textual adversarial samples.
In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD)
arXiv Detail & Related papers (2021-10-28T17:31:51Z) - Learning to Separate Clusters of Adversarial Representations for Robust
Adversarial Detection [50.03939695025513]
We propose a new probabilistic adversarial detector motivated by a recently introduced non-robust feature.
In this paper, we consider the non-robust features as a common property of adversarial examples, and we deduce it is possible to find a cluster in representation space corresponding to the property.
This idea leads us to probability estimate distribution of adversarial representations in a separate cluster, and leverage the distribution for a likelihood based adversarial detector.
arXiv Detail & Related papers (2020-12-07T07:21:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.