Related papers: Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy

Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy

URL: http://arxiv.org/abs/2306.01699v1
Date: Fri, 2 Jun 2023 17:18:20 GMT
Title: Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy
Authors: Siamak Ghodsi, and Eirini Ntoutsi
Abstract summary: Group imbalance, resulting from inadequate or unrepresentative data collection methods, is a primary cause of representation bias in datasets. This paper presents MASC, a data augmentation approach that leverages affinity clustering to balance the representation of non-protected and protected groups of a target dataset.
Score: 10.184056098238765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group imbalance, resulting from inadequate or unrepresentative data collection methods, is a primary cause of representation bias in datasets. Representation bias can exist with respect to different groups of one or more protected attributes and might lead to prejudicial and discriminatory outcomes toward certain groups of individuals; in cases where a learning model is trained on such biased data. This paper presents MASC, a data augmentation approach that leverages affinity clustering to balance the representation of non-protected and protected groups of a target dataset by utilizing instances of the same protected attributes from similar datasets that are categorized in the same cluster as the target dataset by sharing instances of the protected attribute. The proposed method involves constructing an affinity matrix by quantifying distribution discrepancies between dataset pairs and transforming them into a symmetric pairwise similarity matrix. A non-parametric spectral clustering is then applied to this affinity matrix, automatically categorizing the datasets into an optimal number of clusters. We perform a step-by-step experiment as a demo of our method to show the procedure of the proposed data augmentation method and evaluate and discuss its performance. A comparison with other data augmentation methods, both pre- and post-augmentation, is conducted, along with a model evaluation analysis of each method. Our method can handle non-binary protected attributes so, in our experiments, bias is measured in a non-binary protected attribute setup w.r.t. racial groups distribution for two separate minority groups in comparison with the majority group before and after debiasing. Empirical results imply that our method of augmenting dataset biases using real (genuine) data from similar contexts can effectively debias the target datasets comparably to existing data augmentation strategies.

Related papers

Data-Efficient Pretraining with Group-Level Data Influence Modeling [49.18903821780051]
Group-Level Data Influence Modeling (Group-MATES) is a novel data-efficient pretraining method. Group-MATES collects oracle group-level influences by locally probing the pretraining model with data sets. It then fine-tunes a relational data influence model to approximate oracles as relationship-weighted aggregations of individual influences.
arXiv Detail & Related papers (2025-02-20T16:34:46Z)
A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups. We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z)
GroupMixNorm Layer for Learning Fair Models [4.324785083027206]
This research proposes a novel in-processing based GroupMixNorm layer for mitigating bias from deep learning models. The proposed method improves upon several fairness metrics with minimal impact on overall accuracy.
arXiv Detail & Related papers (2023-12-19T09:04:26Z)
Group-blind optimal transport to group parity and its constrained variants [6.70948761466883]
We design a single group-blind projection map that aligns the feature distributions of both groups in the source data. We assume that the source data are unbiased representation of the population. We present numerical results on synthetic data and real data.
arXiv Detail & Related papers (2023-10-17T17:14:07Z)
Leveraging Structure for Improved Classification of Grouped Biased Data [8.121462458089143]
We consider semi-supervised binary classification for applications in which data points are naturally grouped. We derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-outputd classifier.
arXiv Detail & Related papers (2022-12-07T15:18:21Z)
Inv-SENnet: Invariant Self Expression Network for clustering under biased data [17.25929452126843]
We propose a novel framework for jointly removing unwanted attributes (biases) while learning to cluster data points in individual subspaces. Our experimental result on synthetic and real-world datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-11-13T01:19:06Z)
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z)
Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions. We propose an algorithm that optimize for the worst-off group assignments from a constraint set. We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Auditing for Diversity using Representative Examples [17.016881905579044]
We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset. Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the control set to effectively bootstrap an approximation. We show that using a control set whose size is much smaller than the size of the dataset is sufficient to achieve a small approximation error.
arXiv Detail & Related papers (2021-07-15T15:21:17Z)
Contrastive Clustering [57.71729650297379]
We propose Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. In particular, CC achieves an NMI of 0.705 (0.431) on the CIFAR-10 (CIFAR-100) dataset, which is an up to 19% (39%) performance improvement compared with the best baseline.
arXiv Detail & Related papers (2020-09-21T08:54:40Z)
Clustering Binary Data by Application of Combinatorial Optimization Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters. Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics. From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.