Group Distributionally Robust Dataset Distillation with Risk
Minimization
- URL: http://arxiv.org/abs/2402.04676v2
- Date: Mon, 11 Mar 2024 06:56:54 GMT
- Title: Group Distributionally Robust Dataset Distillation with Risk
Minimization
- Authors: Saeed Vahidian, Mingyu Wang, Jianyang Gu, Vyacheslav Kungurtsev, Wei
Jiang, Yiran Chen
- Abstract summary: We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.
We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
- Score: 18.07189444450016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation (DD) has emerged as a widely adopted technique for
crafting a synthetic dataset that captures the essential information of a
training dataset, facilitating the training of accurate neural models. Its
applications span various domains, including transfer learning, federated
learning, and neural architecture search. The most popular methods for
constructing the synthetic data rely on matching the convergence properties of
training the model with the synthetic dataset and the training dataset.
However, targeting the training dataset must be thought of as auxiliary in the
same sense that the training set is an approximate substitute for the
population distribution, and the latter is the data of interest. Yet despite
its popularity, an aspect that remains unexplored is the relationship of DD to
its generalization, particularly across uncommon subgroups. That is, how can we
ensure that a model trained on the synthetic dataset performs well when faced
with samples from regions with low population density? Here, the
representativeness and coverage of the dataset become salient over the
guaranteed training error at inference. Drawing inspiration from
distributionally robust optimization, we introduce an algorithm that combines
clustering with the minimization of a risk measure on the loss to conduct DD.
We provide a theoretical rationale for our approach and demonstrate its
effective generalization and robustness across subgroups through numerical
experiments. The source code is available in
https://github.com/Mming11/RobustDatasetDistillation.
Related papers
- Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Building Manufacturing Deep Learning Models with Minimal and Imbalanced
Training Data Using Domain Adaptation and Data Augmentation [15.333573151694576]
We propose a novel domain adaptation (DA) approach to address the problem of labeled training data scarcity for a target learning task.
Our approach works for scenarios where the source dataset and the dataset available for the target learning task have same or different feature spaces.
We evaluate our combined approach using image data for wafer defect prediction.
arXiv Detail & Related papers (2023-05-31T21:45:34Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Inducing Data Amplification Using Auxiliary Datasets in Adversarial
Training [7.513100214864646]
We propose a biased multi-domain adversarial training (BiaMAT) method that induces training data amplification on a primary dataset.
The proposed method can achieve increased adversarial robustness on a primary dataset by leveraging auxiliary datasets.
arXiv Detail & Related papers (2022-09-27T09:21:40Z) - Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data.
Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data.
Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.