A Keyword Based Approach to Understanding the Overpenalization of
Marginalized Groups by English Marginal Abuse Models on Twitter
- URL: http://arxiv.org/abs/2210.06351v1
- Date: Fri, 7 Oct 2022 20:28:00 GMT
- Title: A Keyword Based Approach to Understanding the Overpenalization of
Marginalized Groups by English Marginal Abuse Models on Twitter
- Authors: Kyra Yee, Alice Schoenauer Sebag, Olivia Redfield, Emily Sheng,
Matthias Eck, Luca Belli
- Abstract summary: Harmful content detection models tend to have higher false positive rates for content from marginalized groups.
We propose a principled approach to detecting and measuring the severity of potential harms associated with a text-based model.
We apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content.
- Score: 2.9604738405097333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Harmful content detection models tend to have higher false positive rates for
content from marginalized groups. In the context of marginal abuse modeling on
Twitter, such disproportionate penalization poses the risk of reduced
visibility, where marginalized communities lose the opportunity to voice their
opinion on the platform. Current approaches to algorithmic harm mitigation, and
bias detection for NLP models are often very ad hoc and subject to human bias.
We make two main contributions in this paper. First, we design a novel
methodology, which provides a principled approach to detecting and measuring
the severity of potential harms associated with a text-based model. Second, we
apply our methodology to audit Twitter's English marginal abuse model, which is
used for removing amplification eligibility of marginally abusive content.
Without utilizing demographic labels or dialect classifiers, we are still able
to detect and measure the severity of issues related to the over-penalization
of the speech of marginalized communities, such as the use of reclaimed speech,
counterspeech, and identity related terms. In order to mitigate the associated
harms, we experiment with adding additional true negative examples and find
that doing so provides improvements to our fairness metrics without large
degradations in model performance.
Related papers
- The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models [78.69526166193236]
Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases.
We propose sc Social Bias Neurons to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias.
As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
arXiv Detail & Related papers (2024-06-14T15:41:06Z) - Modes of Analyzing Disinformation Narratives With AI/ML/Text Mining to Assist in Mitigating the Weaponization of Social Media [0.8287206589886879]
This paper highlights the developing need for quantitative modes for capturing and monitoring malicious communication in social media.
There has been a deliberate "weaponization" of messaging through the use of social networks including by politically oriented entities both state sponsored and privately run.
Despite attempts to introduce moderation on major platforms like Facebook and X/Twitter, there are now established alternative social networks that offer completely unmoderated spaces.
arXiv Detail & Related papers (2024-05-25T00:02:14Z) - Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information [50.29934517930506]
DAFair is a novel approach to address social bias in language models.
We leverage prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias.
arXiv Detail & Related papers (2024-03-14T15:58:36Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Enriching Abusive Language Detection with Community Context [0.3708656266586145]
Use of pejorative expressions can be benign or actively empowering.
Models for abuse detection misclassify these expressions as derogatory, inadvertently censor productive conversations held by marginalized groups.
Our paper highlights how community context can improve classification outcomes in abusive language detection.
arXiv Detail & Related papers (2022-06-16T20:54:02Z) - Improving Generalizability in Implicitly Abusive Language Detection with
Concept Activation Vectors [8.525950031069687]
We show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse.
We propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language.
arXiv Detail & Related papers (2022-04-05T14:52:18Z) - Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases.
We propose steps towards mitigating social biases during text generation.
Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Cross-geographic Bias Detection in Toxicity Modeling [9.128264779870538]
We introduce a weakly supervised method to robustly detect lexical biases in broader geocultural contexts.
We demonstrate that our method identifies salient groups of errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts.
arXiv Detail & Related papers (2021-04-14T17:32:05Z) - Hate Speech Detection and Racial Bias Mitigation in Social Media based
on BERT model [1.9336815376402716]
We introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT.
We evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on Twitter.
arXiv Detail & Related papers (2020-08-14T16:47:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.