Auditing for Diversity using Representative Examples
- URL: http://arxiv.org/abs/2107.07393v1
- Date: Thu, 15 Jul 2021 15:21:17 GMT
- Title: Auditing for Diversity using Representative Examples
- Authors: Vijay Keswani and L. Elisa Celis
- Abstract summary: We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset.
Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the control set to effectively bootstrap an approximation.
We show that using a control set whose size is much smaller than the size of the dataset is sufficient to achieve a small approximation error.
- Score: 17.016881905579044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Assessing the diversity of a dataset of information associated with people is
crucial before using such data for downstream applications. For a given
dataset, this often involves computing the imbalance or disparity in the
empirical marginal distribution of a protected attribute (e.g. gender, dialect,
etc.). However, real-world datasets, such as images from Google Search or
collections of Twitter posts, often do not have protected attributes labeled.
Consequently, to derive disparity measures for such datasets, the elements need
to hand-labeled or crowd-annotated, which are expensive processes.
We propose a cost-effective approach to approximate the disparity of a given
unlabeled dataset, with respect to a protected attribute, using a control set
of labeled representative examples. Our proposed algorithm uses the pairwise
similarity between elements in the dataset and elements in the control set to
effectively bootstrap an approximation to the disparity of the dataset.
Importantly, we show that using a control set whose size is much smaller than
the size of the dataset is sufficient to achieve a small approximation error.
Further, based on our theoretical framework, we also provide an algorithm to
construct adaptive control sets that achieve smaller approximation errors than
randomly chosen control sets. Simulations on two image datasets and one Twitter
dataset demonstrate the efficacy of our approach (using random and adaptive
control sets) in auditing the diversity of a wide variety of datasets.
Related papers
- Diversity Measurement and Subset Selection for Instruction Tuning
Datasets [40.930387018872786]
We use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection.
We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and a maximally diverse reference dataset.
arXiv Detail & Related papers (2024-02-04T02:09:43Z) - Affinity Clustering Framework for Data Debiasing Using Pairwise
Distribution Discrepancy [10.184056098238765]
Group imbalance, resulting from inadequate or unrepresentative data collection methods, is a primary cause of representation bias in datasets.
This paper presents MASC, a data augmentation approach that leverages affinity clustering to balance the representation of non-protected and protected groups of a target dataset.
arXiv Detail & Related papers (2023-06-02T17:18:20Z) - Combining datasets to increase the number of samples and improve model
fitting [7.4771091238795595]
We propose a novel framework called Combine datasets based on Imputation (ComImp)
In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets.
Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
arXiv Detail & Related papers (2022-10-11T06:06:37Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised
Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution.
Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences.
We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Capturing patterns of variation unique to a specific dataset [68.8204255655161]
We propose a tuning-free method that identifies low-dimensional representations of a target dataset relative to one or more comparison datasets.
We show in several experiments that UCA with a single background dataset achieves similar results compared to cPCA with various tuning parameters.
arXiv Detail & Related papers (2021-04-16T15:07:32Z) - Multi-Source Causal Inference Using Control Variates [81.57072928775509]
We propose a general algorithm to estimate causal effects from emphmultiple data sources.
We show theoretically that this reduces the variance of the ATE estimate.
We apply this framework to inference from observational data under an outcome selection bias.
arXiv Detail & Related papers (2021-03-30T21:20:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.