Shielded Representations: Protecting Sensitive Attributes Through
Iterative Gradient-Based Projection
- URL: http://arxiv.org/abs/2305.10204v1
- Date: Wed, 17 May 2023 13:26:57 GMT
- Title: Shielded Representations: Protecting Sensitive Attributes Through
Iterative Gradient-Based Projection
- Authors: Shadi Iskander, Kira Radinsky, Yonatan Belinkov
- Abstract summary: Iterative Gradient-Based Projection is a novel method for removing non-linear encoded concepts from neural representations.
Our results demonstrate that IGBP is effective in mitigating bias through intrinsic and extrinsic evaluations.
- Score: 39.16319169760823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language processing models tend to learn and encode social biases
present in the data. One popular approach for addressing such biases is to
eliminate encoded information from the model's representations. However,
current methods are restricted to removing only linearly encoded information.
In this work, we propose Iterative Gradient-Based Projection (IGBP), a novel
method for removing non-linear encoded concepts from neural representations.
Our method consists of iteratively training neural classifiers to predict a
particular attribute we seek to eliminate, followed by a projection of the
representation on a hypersurface, such that the classifiers become oblivious to
the target attribute. We evaluate the effectiveness of our method on the task
of removing gender and race information as sensitive attributes. Our results
demonstrate that IGBP is effective in mitigating bias through intrinsic and
extrinsic evaluations, with minimal impact on downstream task accuracy.
Related papers
- TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes [4.2560452339165895]
Targeted Concept Erasure (TaCo) is a novel approach that removes sensitive information from final latent representations.
Our experiments show that TaCo outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-12-11T16:22:37Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Pixel-wise Gradient Uncertainty for Convolutional Neural Networks
applied to Out-of-Distribution Segmentation [0.43512163406552007]
We present a method for obtaining uncertainty scores from pixel-wise loss gradients which can be computed efficiently during inference.
Our experiments show the ability of our method to identify wrong pixel classifications and to estimate prediction quality at negligible computational overhead.
arXiv Detail & Related papers (2023-03-13T08:37:59Z) - Linear Adversarial Concept Erasure [108.37226654006153]
We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept.
We show that the method is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.
arXiv Detail & Related papers (2022-01-28T13:00:17Z) - Learning Debiased and Disentangled Representations for Semantic
Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation.
By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes.
Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z) - Fairness via Representation Neutralization [60.90373932844308]
We propose a new mitigation technique, namely, Representation Neutralization for Fairness (RNF)
RNF achieves that fairness by debiasing only the task-specific classification head of DNN models.
Experimental results over several benchmark datasets demonstrate our RNF framework to effectively reduce discrimination of DNN models.
arXiv Detail & Related papers (2021-06-23T22:26:29Z) - Null It Out: Guarding Protected Attributes by Iterative Nullspace
Projection [51.041763676948705]
Iterative Null-space Projection (INLP) is a novel method for removing information from neural representations.
We show that our method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.
arXiv Detail & Related papers (2020-04-16T14:02:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.