Balanced Face Dataset: Guiding StyleGAN to Generate Labeled Synthetic
Face Image Dataset for Underrepresented Group
- URL: http://arxiv.org/abs/2308.03495v1
- Date: Mon, 7 Aug 2023 11:42:50 GMT
- Title: Balanced Face Dataset: Guiding StyleGAN to Generate Labeled Synthetic
Face Image Dataset for Underrepresented Group
- Authors: Kidist Amde Mekonnen
- Abstract summary: Real-world datasets frequently have overrepresented and underrepresented groups.
One solution to mitigate bias in machine learning is to leverage a diverse and representative dataset.
The focus of this study was to generate a robust face image dataset using the StyleGAN model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: For a machine learning model to generalize effectively to unseen data within
a particular problem domain, it is well-understood that the data needs to be of
sufficient size and representative of real-world scenarios. Nonetheless,
real-world datasets frequently have overrepresented and underrepresented
groups. One solution to mitigate bias in machine learning is to leverage a
diverse and representative dataset. Training a model on a dataset that covers
all demographics is crucial to reducing bias in machine learning. However,
collecting and labeling large-scale datasets has been challenging, prompting
the use of synthetic data generation and active labeling to decrease the costs
of manual labeling. The focus of this study was to generate a robust face image
dataset using the StyleGAN model. In order to achieve a balanced distribution
of the dataset among different demographic groups, a synthetic dataset was
created by controlling the generation process of StyleGaN and annotated for
different downstream tasks.
Related papers
- Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Combining Public Human Activity Recognition Datasets to Mitigate Labeled
Data Scarcity [1.274578243851308]
We propose a novel strategy to combine publicly available datasets with the goal of learning a generalized HAR model.
Our experimental evaluation, which includes experimenting with different state-of-the-art neural network architectures, shows that combining public datasets can significantly reduce the number of labeled samples.
arXiv Detail & Related papers (2023-06-23T18:51:22Z) - Zero-shot racially balanced dataset generation using an existing biased
StyleGAN2 [5.463417677777276]
We propose a methodology that leverages the biased generative model StyleGAN2 to create demographically diverse images of synthetic individuals.
By training face recognition models with the resulting balanced dataset containing 50,000 identities per race, we can improve their performance and minimize biases that might have been present in a model trained on a real dataset.
arXiv Detail & Related papers (2023-05-12T18:07:10Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Adaptive Sampling Strategies to Construct Equitable Training Datasets [0.7036032466145111]
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities.
One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.
We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
arXiv Detail & Related papers (2022-01-31T19:19:30Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - A Note on Data Biases in Generative Models [16.86600007830682]
We investigate the impact of dataset quality on the performance of generative models.
We show how societal biases of datasets are replicated by generative models.
We present creative applications through unpaired transfer between diverse datasets such as photographs, oil portraits, and animes.
arXiv Detail & Related papers (2020-12-04T10:46:37Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.