RECAST: Enabling User Recourse and Interpretability of Toxicity
Detection Models with Interactive Visualization
- URL: http://arxiv.org/abs/2102.04427v2
- Date: Wed, 10 Feb 2021 14:42:17 GMT
- Title: RECAST: Enabling User Recourse and Interpretability of Toxicity
Detection Models with Interactive Visualization
- Authors: Austin P Wright, Omar Shaikh, Haekyu Park, Will Epperson, Muhammed
Ahmed, Stephane Pinel, Duen Horng Chau, Diyi Yang
- Abstract summary: We present our work, RECAST, an interactive, open-sourced web tool for visualizing toxic models' predictions.
We found that RECAST was highly effective at helping users reduce toxicity as detected through the model.
This opens a discussion for how toxicity detection models work and should work, and their effect on the future of online discourse.
- Score: 16.35961310670002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the widespread use of toxic language online, platforms are increasingly
using automated systems that leverage advances in natural language processing
to automatically flag and remove toxic comments. However, most automated
systems -- when detecting and moderating toxic language -- do not provide
feedback to their users, let alone provide an avenue of recourse for these
users to make actionable changes. We present our work, RECAST, an interactive,
open-sourced web tool for visualizing these models' toxic predictions, while
providing alternative suggestions for flagged toxic language. Our work also
provides users with a new path of recourse when using these automated
moderation tools. RECAST highlights text responsible for classifying toxicity,
and allows users to interactively substitute potentially toxic phrases with
neutral alternatives. We examined the effect of RECAST via two large-scale user
evaluations, and found that RECAST was highly effective at helping users reduce
toxicity as detected through the model. Users also gained a stronger
understanding of the underlying toxicity criterion used by black-box models,
enabling transparency and recourse. In addition, we found that when users focus
on optimizing language for these models instead of their own judgement (which
is the implied incentive and goal of deploying automated models), these models
cease to be effective classifiers of toxicity compared to human annotations.
This opens a discussion for how toxicity detection models work and should work,
and their effect on the future of online discourse.
Related papers
- Modulating Language Model Experiences through Frictions [56.17593192325438]
Over-consumption of language model outputs risks propagating unchecked errors in the short-term and damaging human capabilities in the long-term.
We propose selective frictions for language model experiences, inspired by behavioral science interventions, to dampen misuse.
arXiv Detail & Related papers (2024-06-24T16:31:11Z) - Recourse for reclamation: Chatting with generative language models [2.877217169371665]
We extend the concept of algorithmic recourse to generative language models.
We provide users a novel mechanism to achieve their desired prediction by dynamically setting thresholds for toxicity filtering.
A pilot study supports the potential of our proposed recourse mechanism.
arXiv Detail & Related papers (2024-03-21T15:14:25Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in
Real-World User-AI Conversation [43.356758428820626]
We introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbots.
Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat.
In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.
arXiv Detail & Related papers (2023-10-26T13:35:41Z) - Reward Modeling for Mitigating Toxicity in Transformer-based Language
Models [0.0]
Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks.
Language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors.
We propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models.
arXiv Detail & Related papers (2022-02-19T19:26:22Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z) - RECAST: Interactive Auditing of Automatic Toxicity Detection Models [39.621867230707814]
We present our ongoing work, RECAST, an interactive tool for examining toxicity detection models by visualizing explanations for predictions and providing alternative wordings for detected toxic speech.
arXiv Detail & Related papers (2020-01-07T00:17:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.