SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification
by Utilising the Notion of "Subjectivity" and "Identity Terms"
- URL: http://arxiv.org/abs/2109.02691v1
- Date: Mon, 6 Sep 2021 18:40:06 GMT
- Title: SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification
by Utilising the Notion of "Subjectivity" and "Identity Terms"
- Authors: Zhixue Zhao, Ziqi Zhang, Frank Hopfgartner
- Abstract summary: We propose a novel approach to tackle such bias in toxic comment classification.
We hypothesize that when a comment is made about a group of people that is characterized by an identity term, the likelihood of that comment being toxic is associated with the subjectivity level of the comment.
- Score: 6.2384249607204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Toxic comment classification models are often found biased toward identity
terms which are terms characterizing a specific group of people such as
"Muslim" and "black". Such bias is commonly reflected in false-positive
predictions, i.e. non-toxic comments with identity terms. In this work, we
propose a novel approach to tackle such bias in toxic comment classification,
leveraging the notion of subjectivity level of a comment and the presence of
identity terms. We hypothesize that when a comment is made about a group of
people that is characterized by an identity term, the likelihood of that
comment being toxic is associated with the subjectivity level of the comment,
i.e. the extent to which the comment conveys personal feelings and opinions.
Building upon the BERT model, we propose a new structure that is able to
leverage these features, and thoroughly evaluate our model on 4 datasets of
varying sizes and representing different social media platforms. The results
show that our model can consistently outperform BERT and a SOTA model devised
to address identity term bias in a different way, with a maximum improvement in
F1 of 2.43% and 1.91% respectively.
Related papers
- Quantifying Bias in Text-to-Image Generative Models [49.60774626839712]
Bias in text-to-image (T2I) models can propagate unfair social representations and may be used to aggressively market ideas or push controversial agendas.
Existing T2I model bias evaluation methods only focus on social biases.
We propose an evaluation methodology to quantify general biases in T2I generative models, without any preconceived notions.
arXiv Detail & Related papers (2023-12-20T14:26:54Z) - Social Bias Probing: Fairness Benchmarking for Language Models [38.180696489079985]
This paper proposes a novel framework for probing language models for social biases by assessing disparate treatment.
We curate SoFa, a large-scale benchmark designed to address the limitations of existing fairness collections.
We show that biases within language models are more nuanced than acknowledged, indicating a broader scope of encoded biases than previously recognized.
arXiv Detail & Related papers (2023-11-15T16:35:59Z) - Modeling subjectivity (by Mimicking Annotator Annotation) in toxic
comment identification across diverse communities [3.0284081180864675]
This study aims to identify intuitive variances from annotator disagreement using quantitative analysis.
We also evaluate the model's ability to mimic diverse viewpoints on toxicity by varying size of the training data.
We conclude that subjectivity is evident across all annotator groups, demonstrating the shortcomings of majority-rule voting.
arXiv Detail & Related papers (2023-11-01T00:17:11Z) - Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona
Biases in Dialogue Systems [103.416202777731]
We study "persona biases", which we define to be the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt.
We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement.
arXiv Detail & Related papers (2023-10-08T21:03:18Z) - The Tail Wagging the Dog: Dataset Construction Biases of Social Bias
Benchmarks [75.58692290694452]
We compare social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye.
We observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models.
arXiv Detail & Related papers (2022-10-18T17:58:39Z) - COFFEE: Counterfactual Fairness for Personalized Text Generation in
Explainable Recommendation [56.520470678876656]
bias inherent in user written text can associate different levels of linguistic quality with users' protected attributes.
We introduce a general framework to achieve measure-specific counterfactual fairness in explanation generation.
arXiv Detail & Related papers (2022-10-14T02:29:10Z) - Exploring Hate Speech Detection with HateXplain and BERT [2.673732496490253]
Hate Speech takes many forms to target communities with derogatory comments, and takes humanity a step back in societal progress.
HateXplain is a recently published and first dataset to use annotated spans in the form of rationales, along with speech classification categories and targeted communities.
We tune BERT to perform this task in the form of rationales and class prediction, and compare our performance on different metrics spanning across accuracy, explainability and bias.
arXiv Detail & Related papers (2022-08-09T01:32:44Z) - Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on
Toxicity Annotation [1.1699472346137738]
We study how raters' self-described identities impact how they annotate toxicity in online comments.
We found that rater identity is a statistically significant factor in how raters will annotate toxicity for identity-related annotations.
We trained models on the annotations from each of the different rater pools, and compared the scores of these models on comments from several test sets.
arXiv Detail & Related papers (2022-05-01T16:08:48Z) - Annotators with Attitudes: How Annotator Beliefs And Identities Bias
Toxic Language Detection [75.54119209776894]
We investigate the effect of annotator identities (who) and beliefs (why) on toxic language annotations.
We consider posts with three characteristics: anti-Black language, African American English dialect, and vulgarity.
Our results show strong associations between annotator identity and beliefs and their ratings of toxicity.
arXiv Detail & Related papers (2021-11-15T18:58:20Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Reading Between the Demographic Lines: Resolving Sources of Bias in
Toxicity Classifiers [0.0]
Perspective API is perhaps the most widely used toxicity classifier in industry.
Google's model tends to unfairly assign higher toxicity scores to comments containing words referring to the identities of commonly targeted groups.
We have constructed several toxicity classifiers with the intention of reducing unintended bias while maintaining strong classification performance.
arXiv Detail & Related papers (2020-06-29T21:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.