Quality-Diversity Generative Sampling for Learning with Synthetic Data
- URL: http://arxiv.org/abs/2312.14369v2
- Date: Tue, 27 Feb 2024 19:21:46 GMT
- Title: Quality-Diversity Generative Sampling for Learning with Synthetic Data
- Authors: Allen Chang, Matthew C. Fontaine, Serena Booth, Maja J. Matari\'c,
Stefanos Nikolaidis
- Abstract summary: Generative models can serve as surrogates for some real data sources by creating synthetic training datasets.
We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space.
- Score: 18.642540152362237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models can serve as surrogates for some real data sources by
creating synthetic training datasets, but in doing so they may transfer biases
to downstream tasks. We focus on protecting quality and diversity when
generating synthetic training datasets. We propose quality-diversity generative
sampling (QDGS), a framework for sampling data uniformly across a user-defined
measure space, despite the data coming from a biased generator. QDGS is a
model-agnostic framework that uses prompt guidance to optimize a quality
objective across measures of diversity for synthetically generated data,
without fine-tuning the generative model. Using balanced synthetic datasets
generated by QDGS, we first debias classifiers trained on color-biased shape
datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we
prompt for desired semantic concepts, such as skin tone and age, to create an
intersectional dataset with a combined blend of visual features. Leveraging
this balanced data for training classifiers improves fairness while maintaining
accuracy on facial recognition benchmarks. Code available at:
https://github.com/Cylumn/qd-generative-sampling.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Testing Deep Learning Recommender Systems Models on Synthetic GAN-Generated Datasets [0.27624021966289597]
The published method Generative Adversarial Networks for Recommender Systems (GANRS) allows generating data sets for collaborative filtering recommendation systems.
We have tested the GANRS method by creating multiple synthetic datasets from three different real datasets taken as a source.
We have also selected six state-of-the-art collaborative filtering deep learning models to test both their comparative performance and the GANRS method.
arXiv Detail & Related papers (2024-10-23T08:09:48Z) - Post-training Model Quantization Using GANs for Synthetic Data
Generation [57.40733249681334]
We investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method.
We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images.
arXiv Detail & Related papers (2023-05-10T11:10:09Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Creating Synthetic Datasets for Collaborative Filtering Recommender
Systems using Generative Adversarial Networks [1.290382979353427]
Research and education in machine learning needs diverse, representative, and open datasets to handle the necessary training, validation, and testing tasks.
To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones.
This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets.
arXiv Detail & Related papers (2023-03-02T14:23:27Z) - Generating High Fidelity Synthetic Data via Coreset selection and
Entropic Regularization [15.866662428675054]
We propose using a combination of coresets selection methods and entropic regularization'' to select the highest fidelity samples.
In a semi-supervised learning scenario, we show that augmenting the labeled data-set, by adding our selected subset of samples, leads to better accuracy improvement.
arXiv Detail & Related papers (2023-01-31T22:59:41Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.