Label-Augmented Dataset Distillation
- URL: http://arxiv.org/abs/2409.16239v1
- Date: Tue, 24 Sep 2024 16:54:22 GMT
- Title: Label-Augmented Dataset Distillation
- Authors: Seoungyoon Kang, Youngsun Lim, Hyunjung Shim,
- Abstract summary: We introduce Label-Augmented dataset Distillation (LADD) to enhance dataset distillation with label augmentations.
LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics.
With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy.
- Score: 13.449340904911725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional dataset distillation primarily focuses on image representation while often overlooking the important role of labels. In this study, we introduce Label-Augmented Dataset Distillation (LADD), a new dataset distillation framework enhancing dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. These dense labels require only a 2.5% increase in storage (ImageNet subsets) with significant performance benefits, providing strong learning signals. Our label generation strategy can complement existing dataset distillation methods for significantly enhancing their training efficiency and performance. Experimental results demonstrate that LADD outperforms existing methods in terms of computational overhead and accuracy. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy. Furthermore, the effectiveness of our method is proven across various datasets, distillation hyperparameters, and algorithms. Finally, our method improves the cross-architecture robustness of the distilled dataset, which is important in the application scenario.
Related papers
- Heavy Labels Out! Dataset Distillation with Label Space Lightening [69.67681224137561]
HeLlO aims at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images.
We demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets.
arXiv Detail & Related papers (2024-08-15T15:08:58Z) - A Label is Worth a Thousand Images in Dataset Distillation [16.272675455429006]
Data $textitquality$ is a crucial factor in the performance of machine learning models.
We show that the main factor explaining the performance of state-of-the-art distillation methods is not the techniques used to generate synthetic data but rather the use of soft labels.
arXiv Detail & Related papers (2024-06-15T03:30:29Z) - Mitigating Bias in Dataset Distillation [62.79454960378792]
We study the impact of bias inside the original dataset on the performance of dataset distillation.
We introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation.
arXiv Detail & Related papers (2024-06-06T18:52:28Z) - ATOM: Attention Mixer for Efficient Dataset Distillation [17.370852204228253]
We propose a module to efficiently distill large datasets using a mixture of channel and spatial-wise attention.
By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets.
arXiv Detail & Related papers (2024-05-02T15:15:01Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Data Distillation Can Be Like Vodka: Distilling More Times For Better
Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance.
PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets.
Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - Dataset Distillation via Factorization [58.8114016318593]
We introduce a emphdataset factorization approach, termed emphHaBa, which is a plug-and-play strategy portable to any existing dataset distillation (DD) baseline.
emphHaBa explores decomposing a dataset into two components: data emphHallucination networks and emphBases.
Our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65%.
arXiv Detail & Related papers (2022-10-30T08:36:19Z) - Flexible Dataset Distillation: Learn Labels Instead of Images [44.73351338165214]
Distilling labels with our new algorithm leads to improved results over prior image-based distillation.
We show it to be more effective than the prior image-based approach to dataset distillation.
arXiv Detail & Related papers (2020-06-15T17:37:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.