Improving Generalizability in Implicitly Abusive Language Detection with
Concept Activation Vectors
- URL: http://arxiv.org/abs/2204.02261v1
- Date: Tue, 5 Apr 2022 14:52:18 GMT
- Title: Improving Generalizability in Implicitly Abusive Language Detection with
Concept Activation Vectors
- Authors: Isar Nejadgholi, Kathleen C. Fraser, Svetlana Kiritchenko
- Abstract summary: We show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse.
We propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language.
- Score: 8.525950031069687
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robustness of machine learning models on ever-changing real-world data is
critical, especially for applications affecting human well-being such as
content moderation. New kinds of abusive language continually emerge in online
discussions in response to current events (e.g., COVID-19), and the deployed
abuse detection systems should be updated regularly to remain accurate. In this
paper, we show that general abusive language classifiers tend to be fairly
reliable in detecting out-of-domain explicitly abusive utterances but fail to
detect new types of more subtle, implicit abuse. Next, we propose an
interpretability technique, based on the Testing Concept Activation Vector
(TCAV) method from computer vision, to quantify the sensitivity of a trained
model to the human-defined concepts of explicit and implicit abusive language,
and use that to explain the generalizability of the model on new data, in this
case, COVID-related anti-Asian hate speech. Extending this technique, we
introduce a novel metric, Degree of Explicitness, for a single instance and
show that the new metric is beneficial in suggesting out-of-domain unlabeled
examples to effectively enrich the training data with informative, implicitly
abusive texts.
Related papers
- Enhancing AI-based Generation of Software Exploits with Contextual Information [9.327315119028809]
The study employs a dataset comprising real shellcodes to evaluate the models across various scenarios.
The experiments are designed to assess the models' resilience against incomplete descriptions, their proficiency in leveraging context for enhanced accuracy, and their ability to discern irrelevant information.
The models demonstrate an ability to filter out unnecessary context, maintaining high levels of accuracy in the generation of offensive security code.
arXiv Detail & Related papers (2024-08-05T11:52:34Z) - Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection.
We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness.
The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z) - The Gift of Feedback: Improving ASR Model Quality by Learning from User
Corrections through Federated Learning [20.643270151774182]
We seek to continually learn from on-device user corrections through Federated Learning (FL)
We explore techniques to target fresh terms that the model has not previously encountered, learn long-tail words, and catastrophic forgetting.
In experimental evaluations, we find that the proposed techniques improve model recognition of fresh terms, while preserving quality on the overall language distribution.
arXiv Detail & Related papers (2023-09-29T21:04:10Z) - Learning Symbolic Rules over Abstract Meaning Representations for
Textual Reinforcement Learning [63.148199057487226]
We propose a modular, NEuroSymbolic Textual Agent (NESTA) that combines a generic semantic generalization with a rule induction system to learn interpretable rules as policies.
Our experiments show that the proposed NESTA method outperforms deep reinforcement learning-based techniques by achieving better to unseen test games and learning from fewer training interactions.
arXiv Detail & Related papers (2023-07-05T23:21:05Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Estimating the Adversarial Robustness of Attributions in Text with
Transformers [44.745873282080346]
We establish a novel definition of attribution robustness (AR) in text classification, based on Lipschitz continuity.
We then propose our novel TransformerExplanationAttack (TEA), a strong adversary that provides a tight estimation for attribution in text classification.
arXiv Detail & Related papers (2022-12-18T20:18:59Z) - No Time Like the Present: Effects of Language Change on Automated
Comment Moderation [0.0]
The spread of online hate has become a significant problem for newspapers that host comment sections.
There is growing interest in using machine learning and natural language processing for automated abusive language detection.
We show using a new German newspaper comments dataset that the classifiers trained with naive ML techniques will underperform on future data.
arXiv Detail & Related papers (2022-07-08T16:39:21Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z) - A Controllable Model of Grounded Response Generation [122.7121624884747]
Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process.
We propose a framework that we call controllable grounded response generation (CGRG)
We show that using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
arXiv Detail & Related papers (2020-05-01T21:22:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.