Contextualizing Hate Speech Classifiers with Post-hoc Explanation
- URL: http://arxiv.org/abs/2005.02439v3
- Date: Mon, 6 Jul 2020 18:54:09 GMT
- Title: Contextualizing Hate Speech Classifiers with Post-hoc Explanation
- Authors: Brendan Kennedy and Xisen Jin and Aida Mostafazadeh Davani and Morteza
Dehghani and Xiang Ren
- Abstract summary: Hate speech classifiers struggle to determine if group identifiers like "gay" or "black" are used in offensive or prejudiced ways.
We propose a novel regularization technique based on these explanations that encourages models to learn from the context.
Our approach improved over baselines in limiting false positives on out-of-domain data.
- Score: 26.044033793878683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hate speech classifiers trained on imbalanced datasets struggle to determine
if group identifiers like "gay" or "black" are used in offensive or prejudiced
ways. Such biases manifest in false positives when these identifiers are
present, due to models' inability to learn the contexts which constitute a
hateful usage of identifiers. We extract SOC post-hoc explanations from
fine-tuned BERT classifiers to efficiently detect bias towards identity terms.
Then, we propose a novel regularization technique based on these explanations
that encourages models to learn from the context of group identifiers in
addition to the identifiers themselves. Our approach improved over baselines in
limiting false positives on out-of-domain data while maintaining or improving
in-domain performance. Project page:
https://inklab.usc.edu/contextualize-hate-speech/.
Related papers
- Unveiling Social Media Comments with a Novel Named Entity Recognition System for Identity Groups [2.5849042763002426]
We develop a Named Entity Recognition (NER) System for Identity Groups.
Our tool not only detects whether a sentence contains an attack but also tags the sentence tokens corresponding to the mentioned group.
We tested the utility of our tool in a case study on social media, annotating and comparing comments from Facebook related to news mentioning identity groups.
arXiv Detail & Related papers (2024-05-13T19:33:18Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Understanding and Mitigating Spurious Correlations in Text
Classification with Neighborhood Analysis [69.07674653828565]
Machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances.
In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis.
We propose a family of regularization methods, NFL (doN't Forget your Language) to mitigate spurious correlations in text classification.
arXiv Detail & Related papers (2023-05-23T03:55:50Z) - Reusing the Task-specific Classifier as a Discriminator:
Discriminator-free Adversarial Domain Adaptation [55.27563366506407]
We introduce a discriminator-free adversarial learning network (DALN) for unsupervised domain adaptation (UDA)
DALN achieves explicit domain alignment and category distinguishment through a unified objective.
DALN compares favorably against the existing state-of-the-art (SOTA) methods on a variety of public datasets.
arXiv Detail & Related papers (2022-04-08T04:40:18Z) - Dynamically Refined Regularization for Improving Cross-corpora Hate
Speech Detection [30.462596705180534]
Hate speech classifiers exhibit substantial performance degradation when evaluated on datasets different from the source.
Previous work has attempted to mitigate this problem by regularizing specific terms from pre-defined static dictionaries.
We propose to automatically identify and reduce spurious correlations using attribution methods with dynamic refinement of the list of terms.
arXiv Detail & Related papers (2022-03-23T16:58:10Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - Hate Speech Detection and Racial Bias Mitigation in Social Media based
on BERT model [1.9336815376402716]
We introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT.
We evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on Twitter.
arXiv Detail & Related papers (2020-08-14T16:47:25Z) - Demoting Racial Bias in Hate Speech Detection [39.376886409461775]
In current hate speech datasets, there exists a correlation between annotators' perceptions of toxicity and signals of African American English (AAE)
In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts.
Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.
arXiv Detail & Related papers (2020-05-25T17:43:22Z) - Towards classification parity across cohorts [16.21248370949611]
This research work aims to achieve classification parity across explicit as well as implicit sensitive features.
We obtain implicit cohorts by clustering embeddings of each individual trained on the language generated by them using a language model.
We improve classification parity by introducing modification to the loss function aimed to minimize the range of model performances across cohorts.
arXiv Detail & Related papers (2020-05-16T16:31:08Z) - Active Learning for Coreference Resolution using Discrete Annotation [76.36423696634584]
We improve upon pairwise annotation for active learning in coreference resolution.
We ask annotators to identify mention antecedents if a presented mention pair is deemed not coreferent.
In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour.
arXiv Detail & Related papers (2020-04-28T17:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.