ToxCCIn: Toxic Content Classification with Interpretability
- URL: http://arxiv.org/abs/2103.01328v1
- Date: Mon, 1 Mar 2021 22:17:10 GMT
- Title: ToxCCIn: Toxic Content Classification with Interpretability
- Authors: Tong Xiang, Sean MacAvaney, Eugene Yang, Nazli Goharian
- Abstract summary: Explanations are important for tasks like offensive language or toxicity detection on social media.
We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption.
We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
- Score: 16.153683223016973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent successes of transformer-based models in terms of
effectiveness on a variety of tasks, their decisions often remain opaque to
humans. Explanations are particularly important for tasks like offensive
language or toxicity detection on social media because a manual appeal process
is often in place to dispute automatically flagged content. In this work, we
propose a technique to improve the interpretability of these models, based on a
simple and powerful assumption: a post is at least as toxic as its most toxic
span. We incorporate this assumption into transformer models by scoring a post
based on the maximum toxicity of its spans and augmenting the training process
to identify correct spans. We find this approach effective and can produce
explanations that exceed the quality of those provided by Logistic Regression
analysis (often regarded as a highly-interpretable model), according to a human
study.
Related papers
- Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable.
Existing susceptibility studies heavily rely on self-reported beliefs.
We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z) - Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented
Models [11.805944680474823]
Goodtriever is a flexible methodology that matches the current state-of-the-art toxicity mitigation.
By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation.
arXiv Detail & Related papers (2023-10-11T15:30:35Z) - Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models.
This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z) - Unified Detoxifying and Debiasing in Language Generation via
Inference-time Adaptive Optimization [32.50246008433889]
Pre-trained language models (PLMs) have prospered in various natural language generation (NLG) tasks due to their ability to generate fairly fluent text.
These models are observed to capture and reproduce harmful contents in training corpora, typically toxic language and social biases, raising severe moral issues.
We propose the first unified framework of detoxifying and debiasing called UDDIA, which jointly formalizes these two problems as rectifying the output space.
arXiv Detail & Related papers (2022-10-10T08:45:25Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Reward Modeling for Mitigating Toxicity in Transformer-based Language
Models [0.0]
Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks.
Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors.
We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
arXiv Detail & Related papers (2022-02-19T19:26:22Z) - Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations [15.152559543181523]
This study is the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection.
We demonstrate that model-agnostic lexical substitutions significantly hurt performance.
Augmentations proposed in prior work on toxicity prove to be less effective.
arXiv Detail & Related papers (2022-01-17T12:48:27Z) - UoB at SemEval-2021 Task 5: Extending Pre-Trained Language Models to
Include Task and Domain-Specific Information for Toxic Span Prediction [0.8376091455761259]
Toxicity is pervasive in social media and poses a major threat to the health of online communities.
Recent introduction of pre-trained language models, which have achieved state-of-the-art results in many NLP tasks, has transformed the way in which we approach natural language processing.
arXiv Detail & Related papers (2021-10-07T18:29:06Z) - Rectified Meta-Learning from Noisy Labels for Robust Image-based Plant
Disease Diagnosis [64.82680813427054]
Plant diseases serve as one of main threats to food security and crop production.
One popular approach is to transform this problem as a leaf image classification task, which can be addressed by the powerful convolutional neural networks (CNNs)
We propose a novel framework that incorporates rectified meta-learning module into common CNN paradigm to train a noise-robust deep network without using extra supervision information.
arXiv Detail & Related papers (2020-03-17T09:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.