Related papers: FairDD: Fair Dataset Distillation via Synchronized Matching

FairDD: Fair Dataset Distillation via Synchronized Matching

URL: http://arxiv.org/abs/2411.19623v1
Date: Fri, 29 Nov 2024 11:22:20 GMT
Title: FairDD: Fair Dataset Distillation via Synchronized Matching
Authors: Qihang Zhou, Shenhao Fang, Shibo He, Wenchao Meng, Jiming Chen,
Abstract summary: We propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches.<n>The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets.<n>We show that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy.
Score: 13.60524473223155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in synchronously matching synthetic datasets to PA-wise groups of original datasets, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This synchronized matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.

Related papers

DD-Ranking: Rethinking the Evaluation of Dataset Distillation [223.28392857127733]
We propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods.<n>By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
arXiv Detail & Related papers (2025-05-19T16:19:50Z)
Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD) CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z)
Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning [10.116674195405126]
We argue that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Our formalization reveals novel applications of DD across different modeling environments. We present numerical results for two case studies important in contemporary settings.
arXiv Detail & Related papers (2024-09-02T18:11:15Z)
Distilling Long-tailed Datasets [13.330572317331198]
We propose a novel long-tailed dataset distillation method, Long-tailed dataset Aware distillation (LAD) LAD reduces the distance between the student and the biased expert trajectories and prevents the tail class bias from being distilled to the synthetic dataset. This work pioneers the field of long-tailed dataset distillation (LTDD), marking the first effective effort to distill long-tailed datasets.
arXiv Detail & Related papers (2024-08-24T15:36:36Z)
Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z)
Exploring the Impact of Dataset Bias on Dataset Distillation [10.742404631413029]
We investigate the influence of dataset bias on Dataset Distillation (DD) DD is a technique to synthesize a smaller dataset that preserves essential information from the original dataset. Experiments demonstrate that biases present in the original dataset significantly impact the performance of the synthetic dataset.
arXiv Detail & Related papers (2024-03-24T06:10:22Z)
DreamDA: Generative Data Augmentation with Diffusion Models [68.22440150419003]
This paper proposes a new classification-oriented framework DreamDA. DreamDA generates diverse samples that adhere to the original data distribution by considering training images in the original data as seeds. In addition, since the labels of the generated data may not align with the labels of their corresponding seed images, we introduce a self-training paradigm for generating pseudo labels.
arXiv Detail & Related papers (2024-03-19T15:04:35Z)
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces. We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z)
Towards Trustworthy Dataset Distillation [26.361077372859498]
dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. We propose a novel paradigm called Trustworthy dataset Distillation (TrustDD) By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection.
arXiv Detail & Related papers (2023-07-18T11:43:01Z)
Chasing Fairness Under Distribution Shift: A Model Weight Perturbation Approach [72.19525160912943]
We first theoretically demonstrate the inherent connection between distribution shift, data perturbation, and model weight perturbation. We then analyze the sufficient conditions to guarantee fairness for the target dataset. Motivated by these sufficient conditions, we propose robust fairness regularization (RFR)
arXiv Detail & Related papers (2023-03-06T17:19:23Z)
Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time. We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z)
Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z)
Unbiased Supervised Contrastive Learning [10.728852691100338]
In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses can fail when dealing with biased data. We derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data.
arXiv Detail & Related papers (2022-11-10T13:44:57Z)
DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks [71.6879432974126]
We introduce DECAF: a GAN-based fair synthetic data generator for tabular data. We show that DECAF successfully removes undesired bias and is capable of generating high-quality synthetic data. We provide theoretical guarantees on the generator's convergence and the fairness of downstream models.
arXiv Detail & Related papers (2021-10-25T12:39:56Z)
Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data [40.82826366059613]
Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a model's predictions. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. We find that applying its consistency loss affords meaningful gains without any unlabeled data at all.
arXiv Detail & Related papers (2020-10-22T18:01:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.