Fair4Free: Generating High-fidelity Fair Synthetic Samples using Data Free Distillation
- URL: http://arxiv.org/abs/2410.01423v1
- Date: Wed, 2 Oct 2024 11:16:11 GMT
- Title: Fair4Free: Generating High-fidelity Fair Synthetic Samples using Data Free Distillation
- Authors: Md Fahim Sikder, Daniel de Leng, Fredrik Heintz,
- Abstract summary: This work presents a novel generative model to generate synthetic fair data using data-free distillation in the latent space.
In our approach, we first train a teacher model to create fair representation and then distil the knowledge to a student model.
The process of distilling the student model is data-free, i.e. the student model does not have access to the training dataset while distilling.
- Score: 4.915744683251151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents Fair4Free, a novel generative model to generate synthetic fair data using data-free distillation in the latent space. Fair4Free can work on the situation when the data is private or inaccessible. In our approach, we first train a teacher model to create fair representation and then distil the knowledge to a student model (using a smaller architecture). The process of distilling the student model is data-free, i.e. the student model does not have access to the training dataset while distilling. After the distillation, we use the distilled model to generate fair synthetic samples. Our extensive experiments show that our synthetic samples outperform state-of-the-art models in all three criteria (fairness, utility and synthetic quality) with a performance increase of 5% for fairness, 8% for utility and 12% in synthetic quality for both tabular and image datasets.
Related papers
- Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation [4.1942958779358674]
Existing bias-mitigating generative methods need in-processing fairness objectives and fail to consider computational overhead.
We present a fair data generation technique based on knowledge distillation, where we use a small architecture to distill the fair representation in the latent space.
Our approaches show a 5%, 5% and 10% rise in performance in fairness, synthetic sample quality and data utility, respectively, than the state-of-the-art fair generative model.
arXiv Detail & Related papers (2024-08-20T11:37:52Z) - uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes [34.947522647009436]
We show that best-distilled models outperform the teacher model by 5-7 WER points and are on par with or outperform similar supervised data filtering setups.
Our models are also 25-50% more compute- and memory-efficient while maintaining performance equal to or better than that of the teacher model.
arXiv Detail & Related papers (2024-07-01T13:07:01Z) - SFDDM: Single-fold Distillation for Diffusion models [4.688721356965585]
We propose a single-fold distillation algorithm, SFDDM, which can flexibly compress the teacher diffusion model into a student model of any desired step.
Experiments on four datasets demonstrate that SFDDM is able to sample high-quality data with steps reduced to as little as approximately 1%.
arXiv Detail & Related papers (2024-05-23T18:11:14Z) - Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models [14.651592234678722]
Current diffusion models tend to inherit bias in the training dataset and generate biased synthetic data.
We introduce a novel model that incorporates sensitive guidance to generate fair synthetic data with balanced joint distributions of the target label and sensitive attributes.
Our method effectively mitigates bias in training data while maintaining the quality of the generated samples.
arXiv Detail & Related papers (2024-04-12T06:08:43Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Data Distillation Can Be Like Vodka: Distilling More Times For Better
Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance.
PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets.
Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Feedback-guided Data Synthesis for Imbalanced Classification [10.836265321046561]
We introduce a framework for augmenting static datasets with useful synthetic samples.
We find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse.
On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes.
arXiv Detail & Related papers (2023-09-29T21:47:57Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - Differentially Private Diffusion Models Generate Useful Synthetic Images [53.94025967603649]
Recent studies have found that, by default, the outputs of some diffusion models do not preserve training data privacy.
By privately fine-tuning ImageNet pre-trained diffusion models with more than 80M parameters, we obtain SOTA results on CIFAR-10 and Camelyon17.
Our results demonstrate that diffusion models fine-tuned with differential privacy can produce useful and provably private synthetic data.
arXiv Detail & Related papers (2023-02-27T15:02:04Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.