Human-Guided Fair Classification for Natural Language Processing
- URL: http://arxiv.org/abs/2212.10154v1
- Date: Tue, 20 Dec 2022 10:46:40 GMT
- Title: Human-Guided Fair Classification for Natural Language Processing
- Authors: Florian E.Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel,
Elliott Ash, Martin Vechev
- Abstract summary: We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to generate semantically similar sentences that differ along sensitive attributes.
We validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification.
- Score: 9.652938946631735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text classifiers have promising applications in high-stake tasks such as
resume screening and content moderation. These classifiers must be fair and
avoid discriminatory decisions by being invariant to perturbations of sensitive
attributes such as gender or ethnicity. However, there is a gap between human
intuition about these perturbations and the formal similarity specifications
capturing them. While existing research has started to address this gap,
current methods are based on hardcoded word replacements, resulting in
specifications with limited expressivity or ones that fail to fully align with
human intuition (e.g., in cases of asymmetric counterfactuals). This work
proposes novel methods for bridging this gap by discovering expressive and
intuitive individual fairness specifications. We show how to leverage
unsupervised style transfer and GPT-3's zero-shot capabilities to automatically
generate expressive candidate pairs of semantically similar sentences that
differ along sensitive attributes. We then validate the generated pairs via an
extensive crowdsourcing study, which confirms that a lot of these pairs align
with human intuition about fairness in the context of toxicity classification.
Finally, we show how limited amounts of human feedback can be leveraged to
learn a similarity specification that can be used to train downstream
fairness-aware models.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Fair Text Classification with Wasserstein Independence [4.887319013701134]
Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups remains an open challenge.
This paper presents a novel method for mitigating biases in neural text classification, agnostic to the model architecture.
arXiv Detail & Related papers (2023-11-21T15:51:06Z) - Improving Fairness using Vision-Language Driven Image Augmentation [60.428157003498995]
Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain.
Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks)
This paper proposes a method to mitigate these correlations to improve fairness.
arXiv Detail & Related papers (2023-11-02T19:51:10Z) - Counterfactual Reasoning for Bias Evaluation and Detection in a Fairness
under Unawareness setting [6.004889078682389]
Current AI regulations require discarding sensitive features in the algorithm's decision-making process to prevent unfair outcomes.
We propose a way to reveal the potential hidden bias of a machine learning model that can persist even when sensitive features are discarded.
arXiv Detail & Related papers (2023-02-16T10:36:18Z) - Towards Procedural Fairness: Uncovering Biases in How a Toxic Language
Classifier Uses Sentiment Information [7.022948483613112]
This work is a step towards evaluating procedural fairness, where unfair processes lead to unfair outcomes.
The produced knowledge can guide debiasing techniques to ensure that important concepts besides identity terms are well-represented in training datasets.
arXiv Detail & Related papers (2022-10-19T16:03:25Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Contrasting Human- and Machine-Generated Word-Level Adversarial Examples
for Text Classification [12.750016480098262]
We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text.
We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms.
arXiv Detail & Related papers (2021-09-09T16:16:04Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - Towards classification parity across cohorts [16.21248370949611]
This research work aims to achieve classification parity across explicit as well as implicit sensitive features.
We obtain implicit cohorts by clustering embeddings of each individual trained on the language generated by them using a language model.
We improve classification parity by introducing modification to the loss function aimed to minimize the range of model performances across cohorts.
arXiv Detail & Related papers (2020-05-16T16:31:08Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.