Scoring Black-Box Models for Adversarial Robustness
- URL: http://arxiv.org/abs/2210.17140v1
- Date: Mon, 31 Oct 2022 08:41:44 GMT
- Title: Scoring Black-Box Models for Adversarial Robustness
- Authors: Jian Vora, Pranay Reddy Samala
- Abstract summary: robustness of models to adversarial attacks has been analyzed.
We propose a simple scoring method for black-box models which indicates their robustness to adversarial input.
- Score: 4.416484585765028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep neural networks are susceptible to adversarial inputs and various
methods have been proposed to defend these models against adversarial attacks
under different perturbation models. The robustness of models to adversarial
attacks has been analyzed by first constructing adversarial inputs for the
model, and then testing the model performance on the constructed adversarial
inputs. Most of these attacks require the model to be white-box, need access to
data labels, and finding adversarial inputs can be computationally expensive.
We propose a simple scoring method for black-box models which indicates their
robustness to adversarial input. We show that adversarially more robust models
have a smaller $l_1$-norm of LIME weights and sharper explanations.
Related papers
- Understanding the Robustness of Randomized Feature Defense Against
Query-Based Adversarial Attacks [23.010308600769545]
Deep neural networks are vulnerable to adversarial examples that find samples close to the original image but can make the model misclassify.
We propose a simple and lightweight defense against black-box attacks by adding random noise to hidden features at intermediate layers of the model at inference time.
Our method effectively enhances the model's resilience against both score-based and decision-based black-box attacks.
arXiv Detail & Related papers (2023-10-01T03:53:23Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted
Attacks [30.863450425927613]
We study the black-box targeted attack problem from the model discrepancy perspective.
We present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack.
We derive a new algorithm for black-box targeted attacks based on our theoretical analysis.
arXiv Detail & Related papers (2022-12-18T08:19:08Z) - Training Meta-Surrogate Model for Transferable Adversarial Attack [98.13178217557193]
We consider adversarial attacks to a black-box model when no queries are allowed.
In this setting, many methods directly attack surrogate models and transfer the obtained adversarial examples to fool the target model.
We show we can obtain a Meta-Surrogate Model (MSM) such that attacks to this model can be easier transferred to other models.
arXiv Detail & Related papers (2021-09-05T03:27:46Z) - Voting based ensemble improves robustness of defensive models [82.70303474487105]
We study whether it is possible to create an ensemble to further improve robustness.
By ensembling several state-of-the-art pre-trained defense models, our method can achieve a 59.8% robust accuracy.
arXiv Detail & Related papers (2020-11-28T00:08:45Z) - Explain2Attack: Text Adversarial Attacks via Cross-Domain
Interpretability [18.92690624514601]
Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans.
In this paper, we propose Explain2Attack, a black-box adversarial attack on text classification task.
We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.
arXiv Detail & Related papers (2020-10-14T04:56:41Z) - Two Sides of the Same Coin: White-box and Black-box Attacks for Transfer
Learning [60.784641458579124]
We show that fine-tuning effectively enhances model robustness under white-box FGSM attacks.
We also propose a black-box attack method for transfer learning models which attacks the target model with the adversarial examples produced by its source model.
To systematically measure the effect of both white-box and black-box attacks, we propose a new metric to evaluate how transferable are the adversarial examples produced by a source model to a target model.
arXiv Detail & Related papers (2020-08-25T15:04:32Z) - Orthogonal Deep Models As Defense Against Black-Box Attacks [71.23669614195195]
We study the inherent weakness of deep models in black-box settings where the attacker may develop the attack using a model similar to the targeted model.
We introduce a novel gradient regularization scheme that encourages the internal representation of a deep model to be orthogonal to another.
We verify the effectiveness of our technique on a variety of large-scale models.
arXiv Detail & Related papers (2020-06-26T08:29:05Z) - Luring of transferable adversarial perturbations in the black-box
paradigm [0.0]
We present a new approach to improve the robustness of a model against black-box transfer attacks.
A removable additional neural network is included in the target model, and is designed to induce the textitluring effect.
Our deception-based method only needs to have access to the predictions of the target model and does not require a labeled data set.
arXiv Detail & Related papers (2020-04-10T06:48:36Z) - DaST: Data-free Substitute Training for Adversarial Attacks [55.76371274622313]
We propose a data-free substitute training method (DaST) to obtain substitute models for adversarial black-box attacks.
To achieve this, DaST utilizes specially designed generative adversarial networks (GANs) to train the substitute models.
Experiments demonstrate the substitute models can achieve competitive performance compared with the baseline models.
arXiv Detail & Related papers (2020-03-28T04:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.