Related papers: Challenges in Automated Debiasing for Toxic Language Detection

Challenges in Automated Debiasing for Toxic Language Detection

URL: http://arxiv.org/abs/2102.00086v1
Date: Fri, 29 Jan 2021 22:03:17 GMT
Title: Challenges in Automated Debiasing for Toxic Language Detection
Authors: Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Noah A. Smith, Yejin Choi
Abstract summary: Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
Score: 81.04406231100323
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

Related papers

Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation [6.781972039785424]
Recent generative large language models (LLMs) show remarkable performance in non-English languages. When prompted in those languages they tend to express higher harmful social biases and toxicity levels. We investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text.
arXiv Detail & Related papers (2024-12-18T17:05:08Z)
On the Role of Speech Data in Reducing Toxicity Detection Bias [22.44133159647888]
We produce a set of high-quality group annotations for the multilingual MuTox dataset. We then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions.
arXiv Detail & Related papers (2024-11-12T19:26:43Z)
Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks. It is quite beneficial and challenging to detect poisoned samples from a mixed dataset. We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z)
Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages. We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality. Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z)
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination [54.865941973768905]
We propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge.
arXiv Detail & Related papers (2023-11-16T07:16:55Z)
Toxicity Detection with Generative Prompt-based Inference [3.9741109244650823]
It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity. In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering.
arXiv Detail & Related papers (2022-05-24T22:44:43Z)
Detoxifying Language Models with a Toxic Corpus [16.7345472998388]
We propose to use toxic corpus as an additional resource to reduce the toxicity. Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially.
arXiv Detail & Related papers (2022-04-30T18:25:18Z)
Mitigating Biases in Toxic Language Detection through Invariant Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns. Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization [16.35961310670002]
We present our work, RECAST, an interactive, open-sourced web tool for visualizing toxic models' predictions. We found that RECAST was highly effective at helping users reduce toxicity as detected through the model. This opens a discussion for how toxicity detection models work and should work, and their effect on the future of online discourse.
arXiv Detail & Related papers (2021-02-08T18:37:50Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.