Related papers: A Note on Data Biases in Generative Models

A Note on Data Biases in Generative Models

URL: http://arxiv.org/abs/2012.02516v1
Date: Fri, 4 Dec 2020 10:46:37 GMT
Title: A Note on Data Biases in Generative Models
Authors: Patrick Esser and Robin Rombach and Bj\"orn Ommer
Abstract summary: We investigate the impact of dataset quality on the performance of generative models. We show how societal biases of datasets are replicated by generative models. We present creative applications through unpaired transfer between diverse datasets such as photographs, oil portraits, and animes.
Score: 16.86600007830682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: It is tempting to think that machines are less prone to unfairness and prejudice. However, machine learning approaches compute their outputs based on data. While biases can enter at any stage of the development pipeline, models are particularly receptive to mirror biases of the datasets they are trained on and therefore do not necessarily reflect truths about the world but, primarily, truths about the data. To raise awareness about the relationship between modern algorithms and the data that shape them, we use a conditional invertible neural network to disentangle the dataset-specific information from the information which is shared across different datasets. In this way, we can project the same image onto different datasets, thereby revealing their inherent biases. We use this methodology to (i) investigate the impact of dataset quality on the performance of generative models, (ii) show how societal biases of datasets are replicated by generative models, and (iii) present creative applications through unpaired transfer between diverse datasets such as photographs, oil portraits, and animes. Our code and an interactive demonstration are available at https://github.com/CompVis/net2net .

Related papers

Understanding Bias in Large-Scale Visual Datasets [5.042580324425314]
We propose a framework to identify the unique visual attributes distinguishing large-scale visual datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information. We generate detailed, open-ended descriptions of each dataset's characteristics.
arXiv Detail & Related papers (2024-12-02T18:56:52Z)
Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models? [29.71939692883025]
We investigate the effects of generated data on image classification tasks, with a specific focus on bias. Hundreds of experiments are conducted on Colorized MNIST, CIFAR-20/100, and Hard ImageNet datasets. Our findings contribute to the ongoing debate on the implications of synthetic data for fairness in real-world applications.
arXiv Detail & Related papers (2024-10-14T05:07:06Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions. We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding. We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data. We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z)
Leaving Reality to Imagination: Robust Classification via Generated Datasets [24.411444438920988]
Recent research on robustness has revealed significant performance gaps between neural image classifiers trained on datasets similar to the test set. We study the question: How do generated datasets influence the natural robustness of image classifiers? We find that Imagenet classifiers trained on real data augmented with generated data achieve higher accuracy and effective robustness than standard training.
arXiv Detail & Related papers (2023-02-05T22:49:33Z)
Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data. Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z)
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects. We tackle this problem from two different angles: algorithm and dataset. We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z)
Identifying the Context Shift between Test Benchmarks and Production Data [1.2259552039796024]
There exists a performance gap between machine learning models' accuracy on dataset benchmarks and real-world production data. We outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors. We present two case-studies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors.
arXiv Detail & Related papers (2022-07-03T14:54:54Z)
Transitioning from Real to Synthetic data: Quantifying the bias in model [1.6134566438137665]
This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data. We demonstrate there exist a varying levels of bias impact on models trained using synthetic data.
arXiv Detail & Related papers (2021-05-10T06:57:14Z)
Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z)
Federated Visual Classification with Real-World Data Distribution [9.564468846277366]
We characterize the effect real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm. We introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits. We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training.
arXiv Detail & Related papers (2020-03-18T07:55:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.