Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations
- URL: http://arxiv.org/abs/2201.06384v1
- Date: Mon, 17 Jan 2022 12:48:27 GMT
- Title: Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations
- Authors: Chris Emmery, \'Akos K\'ad\'ar, Grzegorz Chrupa{\l}a, Walter Daelemans
- Abstract summary: This study is the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection.
We demonstrate that model-agnostic lexical substitutions significantly hurt performance.
Augmentations proposed in prior work on toxicity prove to be less effective.
- Score: 15.152559543181523
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A limited amount of studies investigates the role of model-agnostic
adversarial behavior in toxic content classification. As toxicity classifiers
predominantly rely on lexical cues, (deliberately) creative and evolving
language-use can be detrimental to the utility of current corpora and
state-of-the-art models when they are deployed for content moderation. The less
training data is available, the more vulnerable models might become. This study
is, to our knowledge, the first to investigate the effect of adversarial
behavior and augmentation for cyberbullying detection. We demonstrate that
model-agnostic lexical substitutions significantly hurt classifier performance.
Moreover, when these perturbed samples are used for augmentation, we show
models become robust against word-level perturbations at a slight trade-off in
overall task performance. Augmentations proposed in prior work on toxicity
prove to be less effective. Our results underline the need for such evaluations
in online harm areas with small corpora. The perturbed data, models, and code
are available for reproduction at https://github.com/cmry/augtox
Related papers
- Dissecting Fine-Tuning Unlearning in Large Language Models [12.749301272512222]
Fine-tuning-based unlearning methods prevail for preventing harmful, sensitive, or copyrighted information within large language models.
However, the true effectiveness of these methods is unclear.
In this work, we delve into the limitations of fine-tuning-based unlearning through activation patching and restoration experiments.
arXiv Detail & Related papers (2024-10-09T06:58:09Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Mitigating annotation shift in cancer classification using single image generative models [1.1864334278373239]
This study simulates, analyses and mitigates annotation shifts in cancer classification in the breast mammography domain.
We propose a training data augmentation approach based on single-image generative models for the affected class.
Our study offers key insights into annotation shift in deep learning breast cancer classification and explores the potential of single-image generative models to overcome domain shift challenges.
arXiv Detail & Related papers (2024-05-30T07:02:50Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Investigating Bias In Automatic Toxic Comment Detection: An Empirical
Study [1.5609988622100528]
With surge in online platforms, there has been an upsurge in the user engagement on these platforms via comments and reactions.
A large portion of such textual comments are abusive, rude and offensive to the audience.
With machine learning systems in-place to check such comments coming onto platform, biases present in the training data gets passed onto the classifier leading to discrimination against a set of classes, religion and gender.
arXiv Detail & Related papers (2021-08-14T08:24:13Z) - Explainable Adversarial Attacks in Deep Neural Networks Using Activation
Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples.
We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z) - ToxCCIn: Toxic Content Classification with Interpretability [16.153683223016973]
Explanations are important for tasks like offensive language or toxicity detection on social media.
We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption.
We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
arXiv Detail & Related papers (2021-03-01T22:17:10Z) - Firearm Detection via Convolutional Neural Networks: Comparing a
Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents.
One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis.
We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z) - On the Transferability of Adversarial Attacksagainst Neural Text
Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models.
We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models.
We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.