Exploiting Explainability to Design Adversarial Attacks and Evaluate
Attack Resilience in Hate-Speech Detection Models
- URL: http://arxiv.org/abs/2305.18585v1
- Date: Mon, 29 May 2023 19:59:40 GMT
- Title: Exploiting Explainability to Design Adversarial Attacks and Evaluate
Attack Resilience in Hate-Speech Detection Models
- Authors: Pranath Reddy Kumbam, Sohaib Uddin Syed, Prashanth Thamminedi, Suhas
Harish, Ian Perera, and Bonnie J. Dorr
- Abstract summary: We present an analysis of adversarial robustness exhibited by various hate-speech detection models.
We devise and execute targeted attacks on the text by leveraging the TextAttack tool.
This work paves the way for creating more robust and reliable hate-speech detection systems.
- Score: 0.47334880432883714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of social media has given rise to numerous ethical challenges,
with hate speech among the most significant concerns. Researchers are
attempting to tackle this problem by leveraging hate-speech detection and
employing language models to automatically moderate content and promote civil
discourse. Unfortunately, recent studies have revealed that hate-speech
detection systems can be misled by adversarial attacks, raising concerns about
their resilience. While previous research has separately addressed the
robustness of these models under adversarial attacks and their
interpretability, there has been no comprehensive study exploring their
intersection. The novelty of our work lies in combining these two critical
aspects, leveraging interpretability to identify potential vulnerabilities and
enabling the design of targeted adversarial attacks. We present a comprehensive
and comparative analysis of adversarial robustness exhibited by various
hate-speech detection models. Our study evaluates the resilience of these
models against adversarial attacks using explainability techniques. To gain
insights into the models' decision-making processes, we employ the Local
Interpretable Model-agnostic Explanations (LIME) framework. Based on the
explainability results obtained by LIME, we devise and execute targeted attacks
on the text by leveraging the TextAttack tool. Our findings enhance the
understanding of the vulnerabilities and strengths exhibited by
state-of-the-art hate-speech detection models. This work underscores the
importance of incorporating explainability in the development and evaluation of
such models to enhance their resilience against adversarial attacks.
Ultimately, this work paves the way for creating more robust and reliable
hate-speech detection systems, fostering safer online environments and
promoting ethical discourse on social media platforms.
Related papers
- MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - A Systematic Evaluation of Adversarial Attacks against Speech Emotion Recognition Models [6.854732863866882]
Speech emotion recognition (SER) is constantly gaining attention in recent years due to its potential applications in diverse fields.
Recent studies have shown that deep learning models can be vulnerable to adversarial attacks.
arXiv Detail & Related papers (2024-04-29T09:00:32Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Investigating Human-Identifiable Features Hidden in Adversarial
Perturbations [54.39726653562144]
Our study explores up to five attack algorithms across three datasets.
We identify human-identifiable features in adversarial perturbations.
Using pixel-level annotations, we extract such features and demonstrate their ability to compromise target models.
arXiv Detail & Related papers (2023-09-28T22:31:29Z) - Disentangled Text Representation Learning with Information-Theoretic
Perspective for Adversarial Robustness [17.5771010094384]
Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems.
Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training.
In this paper, we tackle the adversarial challenge from the view of disentangled representation learning.
arXiv Detail & Related papers (2022-10-26T18:14:39Z) - Characterizing the adversarial vulnerability of speech self-supervised
learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries.
The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Evaluating Deception Detection Model Robustness To Linguistic Variation [10.131671217810581]
We propose an analysis of model robustness against linguistic variation in the setting of deceptive news detection.
We consider two prediction tasks and compare three state-of-the-art embeddings to highlight consistent trends in model performance.
We find that character or mixed ensemble models are the most effective defenses and that character perturbation-based attack tactics are more successful.
arXiv Detail & Related papers (2021-04-23T17:25:38Z) - Adversarial Attack and Defense of Structured Prediction Models [58.49290114755019]
In this paper, we investigate attacks and defenses for structured prediction tasks in NLP.
The structured output of structured prediction models is sensitive to small perturbations in the input.
We propose a novel and unified framework that learns to attack a structured prediction model using a sequence-to-sequence model.
arXiv Detail & Related papers (2020-10-04T15:54:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.