Affinity Clustering Framework for Data Debiasing Using Pairwise
Distribution Discrepancy
- URL: http://arxiv.org/abs/2306.01699v1
- Date: Fri, 2 Jun 2023 17:18:20 GMT
- Title: Affinity Clustering Framework for Data Debiasing Using Pairwise
Distribution Discrepancy
- Authors: Siamak Ghodsi, and Eirini Ntoutsi
- Abstract summary: Group imbalance, resulting from inadequate or unrepresentative data collection methods, is a primary cause of representation bias in datasets.
This paper presents MASC, a data augmentation approach that leverages affinity clustering to balance the representation of non-protected and protected groups of a target dataset.
- Score: 10.184056098238765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group imbalance, resulting from inadequate or unrepresentative data
collection methods, is a primary cause of representation bias in datasets.
Representation bias can exist with respect to different groups of one or more
protected attributes and might lead to prejudicial and discriminatory outcomes
toward certain groups of individuals; in cases where a learning model is
trained on such biased data. This paper presents MASC, a data augmentation
approach that leverages affinity clustering to balance the representation of
non-protected and protected groups of a target dataset by utilizing instances
of the same protected attributes from similar datasets that are categorized in
the same cluster as the target dataset by sharing instances of the protected
attribute. The proposed method involves constructing an affinity matrix by
quantifying distribution discrepancies between dataset pairs and transforming
them into a symmetric pairwise similarity matrix. A non-parametric spectral
clustering is then applied to this affinity matrix, automatically categorizing
the datasets into an optimal number of clusters. We perform a step-by-step
experiment as a demo of our method to show the procedure of the proposed data
augmentation method and evaluate and discuss its performance. A comparison with
other data augmentation methods, both pre- and post-augmentation, is conducted,
along with a model evaluation analysis of each method. Our method can handle
non-binary protected attributes so, in our experiments, bias is measured in a
non-binary protected attribute setup w.r.t. racial groups distribution for two
separate minority groups in comparison with the majority group before and after
debiasing. Empirical results imply that our method of augmenting dataset biases
using real (genuine) data from similar contexts can effectively debias the
target datasets comparably to existing data augmentation strategies.
Related papers
- A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - GroupMixNorm Layer for Learning Fair Models [4.324785083027206]
This research proposes a novel in-processing based GroupMixNorm layer for mitigating bias from deep learning models.
The proposed method improves upon several fairness metrics with minimal impact on overall accuracy.
arXiv Detail & Related papers (2023-12-19T09:04:26Z) - Group-blind optimal transport to group parity and its constrained variants [6.70948761466883]
We design a single group-blind projection map that aligns the feature distributions of both groups in the source data.
We assume that the source data are unbiased representation of the population.
We present numerical results on synthetic data and real data.
arXiv Detail & Related papers (2023-10-17T17:14:07Z) - Leveraging Structure for Improved Classification of Grouped Biased Data [8.121462458089143]
We consider semi-supervised binary classification for applications in which data points are naturally grouped.
We derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-outputd classifier.
arXiv Detail & Related papers (2022-12-07T15:18:21Z) - Inv-SENnet: Invariant Self Expression Network for clustering under
biased data [17.25929452126843]
We propose a novel framework for jointly removing unwanted attributes (biases) while learning to cluster data points in individual subspaces.
Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-11-13T01:19:06Z) - Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular
data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes.
We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Auditing for Diversity using Representative Examples [17.016881905579044]
We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset.
Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the control set to effectively bootstrap an approximation.
We show that using a control set whose size is much smaller than the size of the dataset is sufficient to achieve a small approximation error.
arXiv Detail & Related papers (2021-07-15T15:21:17Z) - Contrastive Clustering [57.71729650297379]
We propose Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning.
In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline.
arXiv Detail & Related papers (2020-09-21T08:54:40Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.