On The Robustness of Offensive Language Classifiers
- URL: http://arxiv.org/abs/2203.11331v1
- Date: Mon, 21 Mar 2022 20:44:30 GMT
- Title: On The Robustness of Offensive Language Classifiers
- Authors: Jonathan Rusert, Zubair Shafiq, Padmini Srinivasan
- Abstract summary: Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale.
We study the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks.
Our results show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50% while also being able to preserve the readability and meaning of the modified text.
- Score: 10.742675209112623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media platforms are deploying machine learning based offensive
language classification systems to combat hateful, racist, and other forms of
offensive speech at scale. However, despite their real-world deployment, we do
not yet comprehensively understand the extent to which offensive language
classifiers are robust against adversarial attacks. Prior work in this space is
limited to studying robustness of offensive language classifiers against
primitive attacks such as misspellings and extraneous spaces. To address this
gap, we systematically analyze the robustness of state-of-the-art offensive
language classifiers against more crafty adversarial attacks that leverage
greedy- and attention-based word selection and context-aware embeddings for
word replacement. Our results on multiple datasets show that these crafty
adversarial attacks can degrade the accuracy of offensive language classifiers
by more than 50% while also being able to preserve the readability and meaning
of the modified text.
Related papers
- Developing Linguistic Patterns to Mitigate Inherent Human Bias in
Offensive Language Detection [1.6574413179773761]
We propose a linguistic data augmentation approach to reduce bias in labeling processes.
This approach has the potential to improve offensive language classification tasks across multiple languages.
arXiv Detail & Related papers (2023-12-04T10:20:36Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Identifying Adversarial Attacks on Text Classifiers [32.958568467774704]
In this paper, we analyze adversarial text to determine which methods were used to create it.
Our first contribution is an extensive dataset for attack detection and labeling.
As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification.
arXiv Detail & Related papers (2022-01-21T06:16:04Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Automatic Expansion and Retargeting of Arabic Offensive Language
Training [12.111859709582617]
We employ two key insights, namely that replies on Twitter often imply opposition and some accounts are persistent in their offensiveness towards specific targets.
We show the efficacy of the approach on Arabic tweets with 13% and 79% relative F1-measure improvement in entity specific offensive language detection.
arXiv Detail & Related papers (2021-11-18T08:25:09Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - It's Morphin' Time! Combating Linguistic Discrimination with
Inflectional Perturbations [68.16751625956243]
Only perfect Standard English corpora predisposes neural networks to discriminate against minorities from non-standard linguistic backgrounds.
We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples.
arXiv Detail & Related papers (2020-05-09T04:01:43Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.