Related papers: Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information

Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information

URL: http://arxiv.org/abs/2210.10689v1
Date: Wed, 19 Oct 2022 16:03:25 GMT
Title: Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information
Authors: Isar Nejadgholi, Esma Balk{\i}r, Kathleen C. Fraser, and Svetlana Kiritchenko
Abstract summary: This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes. The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets.
Score: 7.022948483613112
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Previous works on the fairness of toxic language classifiers compare the output of models with different identity terms as input features but do not consider the impact of other important concepts present in the context. Here, besides identity terms, we take into account high-level latent features learned by the classifier and investigate the interaction between these features and identity terms. For a multi-class toxic language classifier, we leverage a concept-based explanation framework to calculate the sensitivity of the model to the concept of sentiment, which has been used before as a salient feature for toxic language detection. Our results show that although for some classes, the classifier has learned the sentiment information as expected, this information is outweighed by the influence of identity terms as input features. This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes. The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets.

Related papers

Verified Language Processing with Hybrid Explainability: A Technical Report [0.7066382982173529]
We present a novel pipeline designed for hybrid explainability to address this.<n>Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar.<n>Preliminary results indicate the effectiveness of this approach in capturing full text similarity.
arXiv Detail & Related papers (2025-07-07T14:00:05Z)
On the reliability of feature attribution methods for speech classification [5.0727678479257685]
We study how factors such as input type and aggregation and perturbation timespan impact the reliability of standard feature attribution methods.<n>We find that standard approaches to feature attribution are generally unreliable when applied to the speech domain.
arXiv Detail & Related papers (2025-05-22T08:59:25Z)
ImpScore: A Learnable Metric For Quantifying The Implicitness Level of Language [40.4052848203136]
Implicit language is essential for natural language processing systems to achieve precise text understanding and facilitate natural interactions with users. This paper develops a scalar metric that quantifies the implicitness level of language without relying on external references. ImpScore is trained using pairwise contrastive learning on a specially curated dataset comprising $112,580$ (implicit sentence, explicit sentence) pairs.
arXiv Detail & Related papers (2024-11-07T20:23:29Z)
Concept-Based Explanations to Test for False Causal Relationships Learned by Abusive Language Classifiers [7.022948483613113]
We consider three well-known abusive language classifiers trained on large English datasets. We first examine the unwanted dependencies learned by the classifiers by assessing their accuracy on a challenge set across all decision thresholds. We then introduce concept-based explanation metrics to assess the influence of the concept on the labels.
arXiv Detail & Related papers (2023-07-04T19:57:54Z)
Human-Guided Fair Classification for Natural Language Processing [9.652938946631735]
We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to generate semantically similar sentences that differ along sensitive attributes. We validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification.
arXiv Detail & Related papers (2022-12-20T10:46:40Z)
Towards Intrinsic Common Discriminative Features Learning for Face Forgery Detection using Adversarial Learning [59.548960057358435]
We propose a novel method which utilizes adversarial learning to eliminate the negative effect of different forgery methods and facial identities. Our face forgery detection model learns to extract common discriminative features through eliminating the effect of forgery methods and facial identities.
arXiv Detail & Related papers (2022-07-08T09:23:59Z)
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models. We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
Necessity and Sufficiency for Explaining Text Classifiers: A Case Study in Hate Speech Detection [7.022948483613112]
We present a novel feature attribution method for explaining text classifiers, and analyze it in the context of hate speech detection. We provide two complementary and theoretically-grounded scores -- necessity and sufficiency -- resulting in more informative explanations. We employ our method to explain the predictions of different hate speech detection models on the same set of curated examples from a test suite, and show that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors.
arXiv Detail & Related papers (2022-05-06T15:34:48Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
Discriminative Attribution from Counterfactuals [64.94009515033984]
We present a method for neural network interpretability by combining feature attribution with counterfactual explanations. We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner.
arXiv Detail & Related papers (2021-09-28T00:53:34Z)
Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z)
On the Effects of Knowledge-Augmented Data in Word Embeddings [0.6749750044497732]
We propose a novel approach for linguistic knowledge injection through data augmentation to learn word embeddings. We show our knowledge augmentation approach improves the intrinsic characteristics of the learned embeddings while not significantly altering their results on a downstream text classification task.
arXiv Detail & Related papers (2020-10-05T02:14:13Z)
Fairness by Learning Orthogonal Disentangled Representations [50.82638766862974]
We propose a novel disentanglement approach to invariant representation problem. We enforce the meaningful representation to be agnostic to sensitive information by entropy. The proposed approach is evaluated on five publicly available datasets.
arXiv Detail & Related papers (2020-03-12T11:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.