Dataset Distillation: A Comprehensive Review
- URL: http://arxiv.org/abs/2301.07014v3
- Date: Sat, 7 Oct 2023 12:16:25 GMT
- Title: Dataset Distillation: A Comprehensive Review
- Authors: Ruonan Yu, Songhua Liu, Xinchao Wang
- Abstract summary: dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
- Score: 76.26276286545284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent success of deep learning is largely attributed to the sheer amount of
data used for training deep neural networks.Despite the unprecedented success,
the massive data, unfortunately, significantly increases the burden on storage
and transmission and further gives rise to a cumbersome model training process.
Besides, relying on the raw data for training \emph{per se} yields concerns
about privacy and copyright. To alleviate these shortcomings, dataset
distillation~(DD), also known as dataset condensation (DC), was introduced and
has recently attracted much research attention in the community. Given an
original dataset, DD aims to derive a much smaller dataset containing synthetic
samples, based on which the trained models yield performance comparable with
those trained on the original dataset. In this paper, we give a comprehensive
review and summary of recent advances in DD and its application. We first
introduce the task formally and propose an overall algorithmic framework
followed by all existing DD methods. Next, we provide a systematic taxonomy of
current methodologies in this area, and discuss their theoretical
interconnections. We also present current challenges in DD through extensive
experiments and envision possible directions for future works.
Related papers
- Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks [10.932880269282014]
We propose the first effective DD method for SSL pre-training.
Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL.
As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders.
arXiv Detail & Related papers (2024-10-03T00:39:25Z) - Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning [10.116674195405126]
We argue that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest.
Our formalization reveals novel applications of DD across different modeling environments.
We present numerical results for two case studies important in contemporary settings.
arXiv Detail & Related papers (2024-09-02T18:11:15Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - Behaviour Distillation [10.437472004180883]
We formalize behaviour distillation, a setting that aims to discover and condense information required for training an expert policy into a synthetic dataset.
We then introduce Hallucinating datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs.
We show that these datasets generalize out of distribution to training policies with a wide range of architectures.
We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion.
arXiv Detail & Related papers (2024-06-21T10:45:43Z) - Group Distributionally Robust Dataset Distillation with Risk
Minimization [18.07189444450016]
We introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD.
We demonstrate its effective generalization and robustness across subgroups through numerical experiments.
arXiv Detail & Related papers (2024-02-07T09:03:04Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Can pre-trained models assist in dataset distillation? [21.613468512330442]
Pre-trained Models (PTMs) function as knowledge repositories, containing extensive information from the original dataset.
This naturally raises a question: Can PTMs effectively transfer knowledge to synthetic datasets, guiding DD accurately?
We systematically study different options in PTMs, including initialization parameters, model architecture, training epoch and domain knowledge.
arXiv Detail & Related papers (2023-10-05T03:51:21Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - Learning to Generate Synthetic Training Data using Gradient Matching and
Implicit Differentiation [77.34726150561087]
This article explores various data distillation techniques that can reduce the amount of data required to successfully train deep networks.
Inspired by recent ideas, we suggest new data distillation techniques based on generative teaching networks, gradient matching, and the Implicit Function Theorem.
arXiv Detail & Related papers (2022-03-16T11:45:32Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.