Reading Between the Demographic Lines: Resolving Sources of Bias in
Toxicity Classifiers
- URL: http://arxiv.org/abs/2006.16402v1
- Date: Mon, 29 Jun 2020 21:40:55 GMT
- Title: Reading Between the Demographic Lines: Resolving Sources of Bias in
Toxicity Classifiers
- Authors: Elizabeth Reichert, Helen Qiu, Jasmine Bayrooti
- Abstract summary: Perspective API is perhaps the most widely used toxicity classifier in industry.
Google's model tends to unfairly assign higher toxicity scores to comments containing words referring to the identities of commonly targeted groups.
We have constructed several toxicity classifiers with the intention of reducing unintended bias while maintaining strong classification performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The censorship of toxic comments is often left to the judgment of imperfect
models. Perspective API, a creation of Google technology incubator Jigsaw, is
perhaps the most widely used toxicity classifier in industry; the model is
employed by several online communities including The New York Times to identify
and filter out toxic comments with the goal of preserving online safety.
Unfortunately, Google's model tends to unfairly assign higher toxicity scores
to comments containing words referring to the identities of commonly targeted
groups (e.g., "woman,'' "gay,'' etc.) because these identities are frequently
referenced in a disrespectful manner in the training data. As a result,
comments generated by marginalized groups referencing their identities are
often mistakenly censored. It is important to be cognizant of this unintended
bias and strive to mitigate its effects. To address this issue, we have
constructed several toxicity classifiers with the intention of reducing
unintended bias while maintaining strong classification performance.
Related papers
- Classification of social media Toxic comments using Machine learning
models [0.0]
The abstract outlines the problem of toxic comments on social media platforms, where individuals use disrespectful, abusive, and unreasonable language.
This behavior is referred to as anti-social behavior, which occurs during online debates, comments, and fights.
The comments containing explicit language can be classified into various categories, such as toxic, severe toxic, obscene, threat, insult, and identity hate.
To protect users from offensive language, companies have started flagging comments and blocking users.
arXiv Detail & Related papers (2023-04-14T05:40:11Z) - Beyond Plain Toxic: Detection of Inappropriate Statements on Flammable
Topics for the Russian Language [76.58220021791955]
We present two text collections labelled according to binary notion of inapropriateness and a multinomial notion of sensitive topic.
To objectivise the notion of inappropriateness, we define it in a data-driven way though crowdsourcing.
arXiv Detail & Related papers (2022-03-04T15:59:06Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Annotators with Attitudes: How Annotator Beliefs And Identities Bias
Toxic Language Detection [75.54119209776894]
We investigate the effect of annotator identities (who) and beliefs (why) on toxic language annotations.
We consider posts with three characteristics: anti-Black language, African American English dialect, and vulgarity.
Our results show strong associations between annotator identity and beliefs and their ratings of toxicity.
arXiv Detail & Related papers (2021-11-15T18:58:20Z) - SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification
by Utilising the Notion of "Subjectivity" and "Identity Terms" [6.2384249607204]
We propose a novel approach to tackle such bias in toxic comment classification.
We hypothesize that when a comment is made about a group of people that is characterized by an identity term, the likelihood of that comment being toxic is associated with the subjectivity level of the comment.
arXiv Detail & Related papers (2021-09-06T18:40:06Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Designing Toxic Content Classification for a Diversity of Perspectives [15.466547856660803]
We survey 17,280 participants to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences.
We find that groups historically at-risk of harassment are more likely to flag a random comment drawn from Reddit, Twitter, or 4chan as toxic.
We show how current one-size-fits-all toxicity classification algorithms, like the Perspective API from Jigsaw, can improve in accuracy by 86% on average through personalized model tuning.
arXiv Detail & Related papers (2021-06-04T16:45:15Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - Reducing Unintended Identity Bias in Russian Hate Speech Detection [0.21485350418225244]
This paper describes our efforts towards classifying hate speech in Russian.
We propose simple techniques of reducing unintended bias, such as generating training data with language models using terms and words related to protected identities as context.
arXiv Detail & Related papers (2020-10-22T12:54:14Z) - Poisoned classifiers are not only backdoored, they are fundamentally
broken [84.67778403778442]
Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
arXiv Detail & Related papers (2020-10-18T19:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.