Related papers: TextDecepter: Hard Label Black Box Attack on Text Classifiers

TextDecepter: Hard Label Black Box Attack on Text Classifiers

URL: http://arxiv.org/abs/2008.06860v6
Date: Mon, 28 Dec 2020 00:23:08 GMT
Title: TextDecepter: Hard Label Black Box Attack on Text Classifiers
Authors: Sachin Saxena
Abstract summary: We present a novel approach for hard-label black-box attacks against Natural Language Processing (NLP) classifiers. Such an attack scenario applies to real-world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning has been proven to be susceptible to carefully crafted samples, known as adversarial examples. The generation of these adversarial examples helps to make the models more robust and gives us an insight into the underlying decision-making of these models. Over the years, researchers have successfully attacked image classifiers in both, white and black-box settings. However, these methods are not directly applicable to texts as text data is discrete. In recent years, research on crafting adversarial examples against textual applications has been on the rise. In this paper, we present a novel approach for hard-label black-box attacks against Natural Language Processing (NLP) classifiers, where no model information is disclosed, and an attacker can only query the model to get a final decision of the classifier, without confidence scores of the classes involved. Such an attack scenario applies to real-world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.

Related papers

The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage [71.8564105095189]
We introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model.<n>We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods.<n>We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference.
arXiv Detail & Related papers (2025-08-13T08:35:16Z)
On Transfer-based Universal Attacks in Pure Black-box Setting [94.92884394009288]
We study the role of prior knowledge of the target model data and number of classes in attack performance. We also provide several interesting insights based on our analysis, and demonstrate that priors cause overestimation in transferability scores.
arXiv Detail & Related papers (2025-04-11T10:41:20Z)
Web Artifact Attacks Disrupt Vision Language Models [61.59021920232986]
Vision-language models (VLMs) are trained on large-scale, lightly curated web datasets. They learn unintended correlations between semantic concepts and unrelated visual signals. Prior work has weaponized these correlations as an attack vector to manipulate model predictions. We introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements.
arXiv Detail & Related papers (2025-03-17T18:59:29Z)
On Adversarial Examples for Text Classification by Perturbing Latent Representations [0.0]
We show that deep learning is vulnerable to adversarial examples in text classification. This weakness indicates that deep learning is not very robust. We create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
arXiv Detail & Related papers (2024-05-06T18:45:18Z)
Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Query Efficient Cross-Dataset Transferable Black-Box Attack on Action Recognition [99.29804193431823]
Black-box adversarial attacks present a realistic threat to action recognition systems. We propose a new attack on action recognition that addresses these shortcomings by generating perturbations. Our method achieves 8% and higher 12% deception rates compared to state-of-the-art query-based and transfer-based attacks.
arXiv Detail & Related papers (2022-11-23T17:47:49Z)
Towards A Conceptually Simple Defensive Approach for Few-shot classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks. We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering. Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z)
A Differentiable Language Model Adversarial Attack on Text Classifiers [10.658675415759697]
We propose a new black-box sentence-level attack for natural language processing. Our method fine-tunes a pre-trained language model to generate adversarial examples. We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
arXiv Detail & Related papers (2021-07-23T14:43:13Z)
ExAD: An Ensemble Approach for Explanation-based Adversarial Detection [17.455233006559734]
We propose ExAD, a framework to detect adversarial examples using an ensemble of explanation techniques. We evaluate our approach using six state-of-the-art adversarial attacks on three image datasets.
arXiv Detail & Related papers (2021-03-22T00:53:07Z)
Explain2Attack: Text Adversarial Attacks via Cross-Domain Interpretability [18.92690624514601]
Research has shown that down-stream models can be easily fooled with adversarial inputs that look like the training data, but slightly perturbed, in a way imperceptible to humans. In this paper, we propose Explain2Attack, a black-box adversarial attack on text classification task. We show that our framework either achieves or out-performs attack rates of the state-of-the-art models, yet with lower queries cost and higher efficiency.
arXiv Detail & Related papers (2020-10-14T04:56:41Z)
Two Sides of the Same Coin: White-box and Black-box Attacks for Transfer Learning [60.784641458579124]
We show that fine-tuning effectively enhances model robustness under white-box FGSM attacks. We also propose a black-box attack method for transfer learning models which attacks the target model with the adversarial examples produced by its source model. To systematically measure the effect of both white-box and black-box attacks, we propose a new metric to evaluate how transferable are the adversarial examples produced by a source model to a target model.
arXiv Detail & Related papers (2020-08-25T15:04:32Z)
Spanning Attack: Reinforce Black-box Attacks with Unlabeled Data [96.92837098305898]
Black-box attacks aim to craft adversarial perturbations by querying input-output pairs of machine learning models. Black-box attacks often suffer from the issue of query inefficiency due to the high dimensionality of the input space. We propose a novel technique called the spanning attack, which constrains adversarial perturbations in a low-dimensional subspace via spanning an auxiliary unlabeled dataset.
arXiv Detail & Related papers (2020-05-11T05:57:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.