Toward Stronger Textual Attack Detectors
- URL: http://arxiv.org/abs/2310.14001v1
- Date: Sat, 21 Oct 2023 13:01:29 GMT
- Title: Toward Stronger Textual Attack Detectors
- Authors: Pierre Colombo, Marine Picot, Nathan Noiry, Guillaume Staerman, Pablo
Piantanida
- Abstract summary: LAROUSSE is a new framework to detect textual adversarial attacks.
STAKEOUT is a new benchmark composed of nine popular attack methods, three datasets, and two pre-trained models.
- Score: 43.543044512474886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The landscape of available textual adversarial attacks keeps growing, posing
severe threats and raising concerns regarding the deep NLP system's integrity.
However, the crucial problem of defending against malicious attacks has only
drawn the attention of the NLP community. The latter is nonetheless
instrumental in developing robust and trustworthy systems. This paper makes two
important contributions in this line of search: (i) we introduce LAROUSSE, a
new framework to detect textual adversarial attacks and (ii) we introduce
STAKEOUT, a new benchmark composed of nine popular attack methods, three
datasets, and two pre-trained models. LAROUSSE is ready-to-use in production as
it is unsupervised, hyperparameter-free, and non-differentiable, protecting it
against gradient-based methods. Our new benchmark STAKEOUT allows for a robust
evaluation framework: we conduct extensive numerical experiments which
demonstrate that LAROUSSE outperforms previous methods, and which allows to
identify interesting factors of detection rate variations.
Related papers
- Unified Prompt Attack Against Text-to-Image Generation Models [30.24530622359188]
We propose UPAM, a framework to evaluate the robustness of T2I models from an attack perspective.
UPAM unifies the attack on both textual and visual defenses.
It also enables gradient-based optimization, overcoming reliance on enumeration for improved efficiency and effectiveness.
arXiv Detail & Related papers (2025-02-23T03:36:18Z) - Robustness of Large Language Models Against Adversarial Attacks [5.312946761836463]
We present a comprehensive study on the robustness of GPT LLM family.
We employ two distinct evaluation methods to assess their resilience.
Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks.
arXiv Detail & Related papers (2024-12-22T13:21:15Z) - The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
The vulnerability of Vision Large Language Models (VLLMs) to jailbreak attacks appears as no surprise.
Recent defense mechanisms against these attacks have reached near-saturation performance on benchmark evaluations.
arXiv Detail & Related papers (2024-11-13T07:57:19Z) - Turning Generative Models Degenerate: The Power of Data Poisoning Attacks [10.36389246679405]
Malicious actors can introduce backdoors through poisoning attacks to generate undesirable outputs.
We conduct an investigation of various poisoning techniques targeting the large language models' fine-tuning phase via the Efficient Fine-Tuning (PEFT) method.
Our study presents the first systematic approach to understanding poisoning attacks targeting NLG tasks during fine-tuning via PEFT.
arXiv Detail & Related papers (2024-07-17T03:02:15Z) - Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme.
Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z) - Adversarial Attacks and Defenses in Large Language Models: Old and New
Threats [21.222184849635823]
Flawed robustness evaluations slow research and provide a false sense of security.
We provide a first set of prerequisites to improve the robustness assessment of new approaches.
We demonstrate on a recently proposed defense that it is easy to overestimate the robustness of a new approach.
arXiv Detail & Related papers (2023-10-30T17:01:02Z) - Interpretability is a Kind of Safety: An Interpreter-based Ensemble for
Adversary Defense [28.398901783858005]
We propose an interpreter-based ensemble framework called X-Ensemble for robust defense adversary.
X-Ensemble employs the Random Forests (RF) model to combine sub-detectors into an ensemble detector for adversarial hybrid attacks defense.
arXiv Detail & Related papers (2023-04-14T04:32:06Z) - Revisiting DeepFool: generalization and improvement [17.714671419826715]
We introduce a new family of adversarial attacks that strike a balance between effectiveness and computational efficiency.
Our proposed attacks are also suitable for evaluating the robustness of large models.
arXiv Detail & Related papers (2023-03-22T11:49:35Z) - On the Adversarial Robustness of Camera-based 3D Object Detection [21.091078268929667]
We investigate the robustness of leading camera-based 3D object detection approaches under various adversarial conditions.
We find that bird's-eye-view-based representations exhibit stronger robustness against localization attacks.
depth-estimation-free approaches have the potential to show stronger robustness.
incorporating multi-frame benign inputs can effectively mitigate adversarial attacks.
arXiv Detail & Related papers (2023-01-25T18:59:15Z) - ADDMU: Detection of Far-Boundary Adversarial Examples with Data and
Model Uncertainty Estimation [125.52743832477404]
Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks.
We propose a new technique, textbfADDMU, which combines two types of uncertainty estimation for both regular and FB adversarial example detection.
Our new method outperforms previous methods by 3.6 and 6.0 emphAUC points under each scenario.
arXiv Detail & Related papers (2022-10-22T09:11:12Z) - Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks.
We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples.
We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - Reliable evaluation of adversarial robustness with an ensemble of
diverse parameter-free attacks [65.20660287833537]
In this paper we propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function.
We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness.
arXiv Detail & Related papers (2020-03-03T18:15:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.