TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes
- URL: http://arxiv.org/abs/2312.06499v4
- Date: Wed, 16 Oct 2024 08:53:23 GMT
- Title: TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes
- Authors: Fanny Jourdan, Louis Béthune, Agustin Picard, Laurent Risser, Nicholas Asher,
- Abstract summary: Targeted Concept Erasure (TaCo) is a novel approach that removes sensitive information from final latent representations.
Our experiments show that TaCo outperforms state-of-the-art methods.
- Score: 4.2560452339165895
- License:
- Abstract: Ensuring fairness in NLP models is crucial, as they often encode sensitive attributes like gender and ethnicity, leading to biased outcomes. Current concept erasure methods attempt to mitigate this by modifying final latent representations to remove sensitive information without retraining the entire model. However, these methods typically rely on linear classifiers, which leave models vulnerable to non-linear adversaries capable of recovering sensitive information. We introduce Targeted Concept Erasure (TaCo), a novel approach that removes sensitive information from final latent representations, ensuring fairness even against non-linear classifiers. Our experiments show that TaCo outperforms state-of-the-art methods, achieving greater reductions in the prediction accuracy of sensitive attributes by non-linear classifier while preserving overall task performance. Code is available on https://github.com/fanny-jourdan/TaCo.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.