Towards Procedural Fairness: Uncovering Biases in How a Toxic Language
Classifier Uses Sentiment Information
- URL: http://arxiv.org/abs/2210.10689v1
- Date: Wed, 19 Oct 2022 16:03:25 GMT
- Title: Towards Procedural Fairness: Uncovering Biases in How a Toxic Language
Classifier Uses Sentiment Information
- Authors: Isar Nejadgholi, Esma Balk{\i}r, Kathleen C. Fraser, and Svetlana
Kiritchenko
- Abstract summary: This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes.
The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets.
- Score: 7.022948483613112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous works on the fairness of toxic language classifiers compare the
output of models with different identity terms as input features but do not
consider the impact of other important concepts present in the context. Here,
besides identity terms, we take into account high-level latent features learned
by the classifier and investigate the interaction between these features and
identity terms. For a multi-class toxic language classifier, we leverage a
concept-based explanation framework to calculate the sensitivity of the model
to the concept of sentiment, which has been used before as a salient feature
for toxic language detection. Our results show that although for some classes,
the classifier has learned the sentiment information as expected, this
information is outweighed by the influence of identity terms as input features.
This work is a step towards evaluating procedural fairness, where unfair
processes lead to unfair outcomes. The produced knowledge can guide debiasing
techniques to ensure that important concepts besides identity terms are
well-represented in training datasets.
Related papers
- Concept-Based Explanations to Test for False Causal Relationships
Learned by Abusive Language Classifiers [7.022948483613113]
We consider three well-known abusive language classifiers trained on large English datasets.
We first examine the unwanted dependencies learned by the classifiers by assessing their accuracy on a challenge set across all decision thresholds.
We then introduce concept-based explanation metrics to assess the influence of the concept on the labels.
arXiv Detail & Related papers (2023-07-04T19:57:54Z) - Human-Guided Fair Classification for Natural Language Processing [9.652938946631735]
We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to generate semantically similar sentences that differ along sensitive attributes.
We validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification.
arXiv Detail & Related papers (2022-12-20T10:46:40Z) - Towards Intrinsic Common Discriminative Features Learning for Face
Forgery Detection using Adversarial Learning [59.548960057358435]
We propose a novel method which utilizes adversarial learning to eliminate the negative effect of different forgery methods and facial identities.
Our face forgery detection model learns to extract common discriminative features through eliminating the effect of forgery methods and facial identities.
arXiv Detail & Related papers (2022-07-08T09:23:59Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Necessity and Sufficiency for Explaining Text Classifiers: A Case Study
in Hate Speech Detection [7.022948483613112]
We present a novel feature attribution method for explaining text classifiers, and analyze it in the context of hate speech detection.
We provide two complementary and theoretically-grounded scores -- necessity and sufficiency -- resulting in more informative explanations.
We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors.
arXiv Detail & Related papers (2022-05-06T15:34:48Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - Discriminative Attribution from Counterfactuals [64.94009515033984]
We present a method for neural network interpretability by combining feature attribution with counterfactual explanations.
We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner.
arXiv Detail & Related papers (2021-09-28T00:53:34Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - On the Effects of Knowledge-Augmented Data in Word Embeddings [0.6749750044497732]
We propose a novel approach for linguistic knowledge injection through data augmentation to learn word embeddings.
We show our knowledge augmentation approach improves the intrinsic characteristics of the learned embeddings while not significantly altering their results on a downstream text classification task.
arXiv Detail & Related papers (2020-10-05T02:14:13Z) - Fairness by Learning Orthogonal Disentangled Representations [50.82638766862974]
We propose a novel disentanglement approach to invariant representation problem.
We enforce the meaningful representation to be agnostic to sensitive information by entropy.
The proposed approach is evaluated on five publicly available datasets.
arXiv Detail & Related papers (2020-03-12T11:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.