Hate Speech Detection and Racial Bias Mitigation in Social Media based
on BERT model
- URL: http://arxiv.org/abs/2008.06460v2
- Date: Fri, 28 Aug 2020 10:06:40 GMT
- Title: Hate Speech Detection and Racial Bias Mitigation in Social Media based
on BERT model
- Authors: Marzieh Mozafari, Reza Farahbakhsh, Noel Crespi
- Abstract summary: We introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT.
We evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on Twitter.
- Score: 1.9336815376402716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Disparate biases associated with datasets and trained classifiers in hateful
and abusive content identification tasks have raised many concerns recently.
Although the problem of biased datasets on abusive language detection has been
addressed more frequently, biases arising from trained classifiers have not yet
been a matter of concern. Here, we first introduce a transfer learning approach
for hate speech detection based on an existing pre-trained language model
called BERT and evaluate the proposed model on two publicly available datasets
annotated for racism, sexism, hate or offensive content on Twitter. Next, we
introduce a bias alleviation mechanism in hate speech detection task to
mitigate the effect of bias in training set during the fine-tuning of our
pre-trained BERT-based model. Toward that end, we use an existing
regularization method to reweight input samples, thereby decreasing the effects
of high correlated training set' s n-grams with class labels, and then
fine-tune our pre-trained BERT-based model with the new re-weighted samples. To
evaluate our bias alleviation mechanism, we employ a cross-domain approach in
which we use the trained classifiers on the aforementioned datasets to predict
the labels of two new datasets from Twitter, AAE-aligned and White-aligned
groups, which indicate tweets written in African-American English (AAE) and
Standard American English (SAE) respectively. The results show the existence of
systematic racial bias in trained classifiers as they tend to assign tweets
written in AAE from AAE-aligned group to negative classes such as racism,
sexism, hate, and offensive more often than tweets written in SAE from
White-aligned. However, the racial bias in our classifiers reduces
significantly after our bias alleviation mechanism is incorporated. This work
could institute the first step towards debiasing hate speech and abusive
language detection systems.
Related papers
- A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers [0.0]
We create counterfactual examples with small perturbations on target-domain data instead of relying on templates or specific datasets for bias detection.
On widely used classifiers for subjectivity analysis, including sentiment, emotion, hate speech, our results demonstrate positive biases related to the language spoken in a country.
arXiv Detail & Related papers (2024-07-01T22:17:17Z) - The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases.
We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias.
As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z) - Detecting Bias in Large Language Models: Fine-tuned KcBERT [0.0]
We define such harm as societal bias and assess ethnic, gender, and racial biases in a model fine-tuned with Korean comments.
Our contribution lies in demonstrating that societal bias exists in Korean language models due to language-dependent characteristics.
arXiv Detail & Related papers (2024-03-16T02:27:19Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Detecting Unintended Social Bias in Toxic Language Datasets [32.724030288421474]
This paper introduces a new dataset ToxicBias curated from the existing dataset of Kaggle competition named "Jigsaw Unintended Bias in Toxicity Classification"
The dataset contains instances annotated for five different bias categories, viz., gender, race/ethnicity, religion, political, and LGBTQ.
We train transformer-based models using our curated datasets and report baseline performance for bias identification, target generation, and bias implications.
arXiv Detail & Related papers (2022-10-21T06:50:12Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Prototypical Classifier for Robust Class-Imbalanced Learning [64.96088324684683]
We propose textitPrototypical, which does not require fitting additional parameters given the embedding network.
Prototypical produces balanced and comparable predictions for all classes even though the training set is class-imbalanced.
We test our method on CIFAR-10LT, CIFAR-100LT and Webvision datasets, observing that Prototypical obtains substaintial improvements compared with state of the arts.
arXiv Detail & Related papers (2021-10-22T01:55:01Z) - Mitigating Racial Biases in Toxic Language Detection with an
Equity-Based Ensemble Framework [9.84413545378636]
Recent research has demonstrated how racial biases against users who write African American English exist in popular toxic language datasets.
We propose additional descriptive fairness metrics to better understand the source of these biases.
We show that our proposed framework substantially reduces the racial biases that the model learns from these datasets.
arXiv Detail & Related papers (2021-09-27T15:54:05Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - Improving Robustness by Augmenting Training Sentences with
Predicate-Argument Structures [62.562760228942054]
Existing approaches to improve robustness against dataset biases mostly focus on changing the training objective.
We propose to augment the input sentences in the training data with their corresponding predicate-argument structures.
We show that without targeting a specific bias, our sentence augmentation improves the robustness of transformer models against multiple biases.
arXiv Detail & Related papers (2020-10-23T16:22:05Z) - Demoting Racial Bias in Hate Speech Detection [39.376886409461775]
In current hate speech datasets, there exists a correlation between annotators' perceptions of toxicity and signals of African American English (AAE)
In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts.
Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.
arXiv Detail & Related papers (2020-05-25T17:43:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.