Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
- URL: http://arxiv.org/abs/2509.22291v1
- Date: Fri, 26 Sep 2025 12:53:20 GMT
- Title: Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
- Authors: Yifan Wang, Mayank Jobanputra, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg,
- Abstract summary: We conduct the first systematic study of the relationship between explainability and fairness in hate speech detection.<n>We examine three key dimensions: identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training.<n>Our findings show that input-based explanations can effectively detect biased predictions, but they are unreliable for selecting fair models among candidates.
- Score: 22.18673021255833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.
Related papers
- Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching [0.49157446832511503]
We investigate whether the way training and testing data are sampled affects the reliability of fairness metrics.<n>Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data.<n>We propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias.
arXiv Detail & Related papers (2025-04-23T19:28:30Z) - Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age.
A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data.
In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z) - On Comparing Fair Classifiers under Data Bias [42.43344286660331]
We study the effect of varying data biases on the accuracy and fairness of fair classifiers.
Our experiments show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments.
arXiv Detail & Related papers (2023-02-12T13:04:46Z) - Fair Enough: Standardizing Evaluation and Model Selection for Fairness
Research in NLP [64.45845091719002]
Modern NLP systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct.
This paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning.
arXiv Detail & Related papers (2023-02-11T14:54:00Z) - Fairness and Explainability: Bridging the Gap Towards Fair Model
Explanations [12.248793742165278]
We bridge the gap between fairness and explainability by presenting a novel perspective of procedure-oriented fairness based on explanations.
We propose a Comprehensive Fairness Algorithm (CFA), which simultaneously fulfills multiple objectives - improving traditional fairness, satisfying explanation fairness, and maintaining the utility performance.
arXiv Detail & Related papers (2022-12-07T18:35:54Z) - InterFair: Debiasing with Natural Language Feedback for Fair
Interpretable Predictions [30.246111670391375]
We argue that a favorable debiasing method should use sensitive information 'fairly,' with explanations, rather than blindly eliminating it.
We explore two interactive setups with a frozen predictive model and show that users able to provide feedback can achieve a better and fairer balance between task performance and bias mitigation.
arXiv Detail & Related papers (2022-10-14T00:54:12Z) - Prisoners of Their Own Devices: How Models Induce Data Bias in
Performative Prediction [4.874780144224057]
A biased model can make decisions that disproportionately harm certain groups in society.
Much work has been devoted to measuring unfairness in static ML environments, but not in dynamic, performative prediction ones.
We propose a taxonomy to characterize bias in the data, and study cases where it is shaped by model behaviour.
arXiv Detail & Related papers (2022-06-27T10:56:04Z) - Towards Equal Opportunity Fairness through Adversarial Learning [64.45845091719002]
Adversarial training is a common approach for bias mitigation in natural language processing.
We propose an augmented discriminator for adversarial training, which takes the target class as input to create richer features.
arXiv Detail & Related papers (2022-03-12T02:22:58Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.