Confusing and Detecting ML Adversarial Attacks with Injected Attractors
- URL: http://arxiv.org/abs/2003.02732v4
- Date: Mon, 8 Mar 2021 07:56:30 GMT
- Title: Confusing and Detecting ML Adversarial Attacks with Injected Attractors
- Authors: Jiyi Zhang, Ee-Chien Chang, Hwee Kuan Lee
- Abstract summary: A machine learning adversarial attack finds adversarial samples of a victim model $mathcal M$ by following the gradient of some attack objective functions.
We take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals.
We observe that decoders of watermarking schemes exhibit properties of attractors and give a generic method that injects attractors into the victim model.
- Score: 13.939695351344538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many machine learning adversarial attacks find adversarial samples of a
victim model ${\mathcal M}$ by following the gradient of some attack objective
functions, either explicitly or implicitly. To confuse and detect such attacks,
we take the proactive approach that modifies those functions with the goal of
misleading the attacks to some local minimals, or to some designated regions
that can be easily picked up by an analyzer. To achieve this goal, we propose
adding a large number of artifacts, which we called $attractors$, onto the
otherwise smooth function. An attractor is a point in the input space, where
samples in its neighborhood have gradient pointing toward it. We observe that
decoders of watermarking schemes exhibit properties of attractors and give a
generic method that injects attractors from a watermark decoder into the victim
model ${\mathcal M}$. This principled approach allows us to leverage on known
watermarking schemes for scalability and robustness and provides explainability
of the outcomes. Experimental studies show that our method has competitive
performance. For instance, for un-targeted attacks on CIFAR-10 dataset, we can
reduce the overall attack success rate of DeepFool to 1.9%, whereas known
defense LID, FS and MagNet can reduce the rate to 90.8%, 98.5% and 78.5%
respectively.
Related papers
- AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries.
We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots.
We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z) - DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data.
Recent attack methods can achieve a relatively high attack success rate (ASR)
We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z) - DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial
Purification [63.65630243675792]
Diffusion-based purification defenses leverage diffusion models to remove crafted perturbations of adversarial examples.
Recent studies show that even advanced attacks cannot break such defenses effectively.
We propose a unified framework DiffAttack to perform effective and efficient attacks against diffusion-based purification defenses.
arXiv Detail & Related papers (2023-10-27T15:17:50Z) - PRAT: PRofiling Adversarial aTtacks [52.693011665938734]
We introduce a novel problem of PRofiling Adversarial aTtacks (PRAT)
Given an adversarial example, the objective of PRAT is to identify the attack used to generate it.
We use AID to devise a novel framework for the PRAT objective.
arXiv Detail & Related papers (2023-09-20T07:42:51Z) - Object-fabrication Targeted Attack for Object Detection [54.10697546734503]
adversarial attack for object detection contains targeted attack and untargeted attack.
New object-fabrication targeted attack mode can mislead detectors tofabricate extra false objects with specific target labels.
arXiv Detail & Related papers (2022-12-13T08:42:39Z) - Unreasonable Effectiveness of Last Hidden Layer Activations [0.5156484100374058]
We show that using some widely known activation functions in the output layer of the model with high temperature values has the effect of zeroing out the gradients for both targeted and untargeted attack cases.
We've experimentally verified the efficacy of our approach on MNIST (Digit), CIFAR10 datasets.
arXiv Detail & Related papers (2022-02-15T12:02:59Z) - Constrained Gradient Descent: A Powerful and Principled Evasion Attack
Against Neural Networks [19.443306494201334]
We introduce several innovations that make white-box targeted attacks follow the intuition of the attacker's goal.
First, we propose a new loss function that explicitly captures the goal of targeted attacks.
Second, we propose a new attack method that uses a further developed version of our loss function capturing both the misclassification objective and the $L_infty$ distance limit.
arXiv Detail & Related papers (2021-12-28T17:36:58Z) - RamBoAttack: A Robust Query Efficient Deep Neural Network Decision
Exploit [9.93052896330371]
We develop a robust query efficient attack capable of avoiding entrapment in a local minimum and misdirection from noisy gradients.
The RamBoAttack is more robust to the different sample inputs available to an adversary and the targeted class.
arXiv Detail & Related papers (2021-12-10T01:25:24Z) - Detection of Adversarial Supports in Few-shot Classifiers Using Feature
Preserving Autoencoders and Self-Similarity [89.26308254637702]
We propose a detection strategy to highlight adversarial support sets.
We make use of feature preserving autoencoder filtering and also the concept of self-similarity of a support set to perform this detection.
Our method is attack-agnostic and also the first to explore detection for few-shot classifiers to the best of our knowledge.
arXiv Detail & Related papers (2020-12-09T14:13:41Z) - Detection of Iterative Adversarial Attacks via Counter Attack [4.549831511476249]
Deep neural networks (DNNs) have proven to be powerful tools for processing unstructured data.
For high-dimensional data, like images, they are inherently vulnerable to adversarial attacks.
In this work we outline a mathematical proof that the CW attack can be used as a detector itself.
arXiv Detail & Related papers (2020-09-23T21:54:36Z) - Minimax Defense against Gradient-based Adversarial Attacks [2.4403071643841243]
We introduce a novel approach that uses minimax optimization to foil gradient-based adversarial attacks.
Our minimax defense achieves 98.07% (MNIST-default 98.93%), 73.90% (CIFAR-10-default 83.14%) and 94.54% (TRAFFIC-default 96.97%)
Our Minimax adversarial approach presents a significant shift in defense strategy for neural network classifiers.
arXiv Detail & Related papers (2020-02-04T12:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.