Related papers: TaCo: Targeted Concept Removal in Output Embeddings for NLP via Information Theory and Explainability

TaCo: Targeted Concept Removal in Output Embeddings for NLP via Information Theory and Explainability

URL: http://arxiv.org/abs/2312.06499v3
Date: Fri, 12 Apr 2024 15:50:14 GMT
Title: TaCo: Targeted Concept Removal in Output Embeddings for NLP via Information Theory and Explainability
Authors: Fanny Jourdan, Louis Béthune, Agustin Picard, Laurent Risser, Nicholas Asher,
Abstract summary: Information theory indicates that a model should not be able to predict sensitive variables, such as gender, ethnicity, and age. We present a novel approach that operates at the embedding level of an NLP model. We show that the proposed post-hoc approach significantly reduces gender-related associations in NLP models.
Score: 4.2560452339165895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The fairness of Natural Language Processing (NLP) models has emerged as a crucial concern. Information theory indicates that to achieve fairness, a model should not be able to predict sensitive variables, such as gender, ethnicity, and age. However, information related to these variables often appears implicitly in language, posing a challenge in identifying and mitigating biases effectively. To tackle this issue, we present a novel approach that operates at the embedding level of an NLP model, independent of the specific architecture. Our method leverages insights from recent advances in XAI techniques and employs an embedding transformation to eliminate implicit information from a selected variable. By directly manipulating the embeddings in the final layer, our approach enables a seamless integration into existing models without requiring significant modifications or retraining. In evaluation, we show that the proposed post-hoc approach significantly reduces gender-related associations in NLP models while preserving the overall performance and functionality of the models. An implementation of our method is available: https://github.com/fanny-jourdan/TaCo

Related papers

Nonlinear Concept Erasure: a Density Matching Approach [0.0]
We propose a process that removes information related to a specific concept from distributed representations while preserving as much of the remaining semantic information as possible.<n>Our approach involves learning an projection in the embedding space, designed to make the class-conditional feature distributions of the discrete concept to erase indistinguishable after projection.<n>Our method, termed $overlinemathrmL$EOPARD, achieves state-of-the-art performance in nonlinear erasure of a discrete attribute on classic natural language processing benchmarks.
arXiv Detail & Related papers (2025-07-16T15:36:15Z)
Fairness-Aware Low-Rank Adaptation Under Demographic Privacy Constraints [4.647881572951815]
Pre-trained foundation models can be adapted for specific tasks using Low-Rank Adaptation (LoRA) Existing fairness-aware fine-tuning methods rely on direct access to sensitive attributes or their predictors. We introduce a set of LoRA-based fine-tuning methods that can be trained in a distributed fashion.
arXiv Detail & Related papers (2025-03-07T18:49:57Z)
Adversarial Purification by Consistency-aware Latent Space Optimization on Data Manifolds [48.37843602248313]
Deep neural networks (DNNs) are vulnerable to adversarial samples crafted by adding imperceptible perturbations to clean data, potentially leading to incorrect and dangerous predictions. We propose Consistency Model-based Adversarial Purification (CMAP), which optimize vectors within the latent space of a pre-trained consistency model to generate samples for restoring clean data. CMAP significantly enhances robustness against strong adversarial attacks while preserving high natural accuracy.
arXiv Detail & Related papers (2024-12-11T14:14:02Z)
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z)
Nonlinear Transformations Against Unlearnable Datasets [4.876873339297269]
Automated scraping stands out as a common method for collecting data in deep learning models without the authorization of data owners. Recent studies have begun to tackle the privacy concerns associated with this data collection method. The data generated by those approaches, called "unlearnable" examples, are prevented "learning" by deep learning models.
arXiv Detail & Related papers (2024-06-05T03:00:47Z)
Latent Enhancing AutoEncoder for Occluded Image Classification [2.6217304977339473]
We introduce LEARN: Latent Enhancing feAture Reconstruction Network. An auto-encoder based network that can be incorporated into the classification model before its head. On the OccludedPASCAL3D+ dataset, the proposed LEARN outperforms standard classification models.
arXiv Detail & Related papers (2024-02-10T12:22:31Z)
Credible Teacher for Semi-Supervised Object Detection in Open Scene [106.25850299007674]
In Open Scene Semi-Supervised Object Detection (O-SSOD), unlabeled data may contain unknown objects not observed in the labeled data. It is detrimental to the current methods that mainly rely on self-training, as more uncertainty leads to the lower localization and classification precision of pseudo labels. We propose Credible Teacher, an end-to-end framework to prevent uncertain pseudo labels from misleading the model.
arXiv Detail & Related papers (2024-01-01T08:19:21Z)
XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification. XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations. Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z)
Model Debiasing via Gradient-based Explanation on Representation [14.673988027271388]
We propose a novel fairness framework that performs debiasing with regard to sensitive attributes and proxy attributes. Our framework achieves better fairness-accuracy trade-off on unstructured and structured datasets than previous state-of-the-art approaches.
arXiv Detail & Related papers (2023-05-20T11:57:57Z)
Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection [39.16319169760823]
Iterative Gradient-Based Projection is a novel method for removing non-linear encoded concepts from neural representations. Our results demonstrate that IGBP is effective in mitigating bias through intrinsic and extrinsic evaluations.
arXiv Detail & Related papers (2023-05-17T13:26:57Z)
Linear Adversarial Concept Erasure [108.37226654006153]
We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept. We show that the method is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.
arXiv Detail & Related papers (2022-01-28T13:00:17Z)
Fairness via Representation Neutralization [60.90373932844308]
We propose a new mitigation technique, namely, Representation Neutralization for Fairness (RNF) RNF achieves that fairness by debiasing only the task-specific classification head of DNN models. Experimental results over several benchmark datasets demonstrate our RNF framework to effectively reduce discrimination of DNN models.
arXiv Detail & Related papers (2021-06-23T22:26:29Z)
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection [51.041763676948705]
Iterative Null-space Projection (INLP) is a novel method for removing information from neural representations. We show that our method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.
arXiv Detail & Related papers (2020-04-16T14:02:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.