Related papers: Label-Augmented Dataset Distillation

Label-Augmented Dataset Distillation

URL: http://arxiv.org/abs/2409.16239v1
Date: Tue, 24 Sep 2024 16:54:22 GMT
Title: Label-Augmented Dataset Distillation
Authors: Seoungyoon Kang, Youngsun Lim, Hyunjung Shim,
Abstract summary: We introduce Label-Augmented dataset Distillation (LADD) to enhance dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy.
Score: 13.449340904911725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional dataset distillation primarily focuses on image representation while often overlooking the important role of labels. In this study, we introduce Label-Augmented Dataset Distillation (LADD), a new dataset distillation framework enhancing dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. These dense labels require only a 2.5% increase in storage (ImageNet subsets) with significant performance benefits, providing strong learning signals. Our label generation strategy can complement existing dataset distillation methods for significantly enhancing their training efficiency and performance. Experimental results demonstrate that LADD outperforms existing methods in terms of computational overhead and accuracy. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy. Furthermore, the effectiveness of our method is proven across various datasets, distillation hyperparameters, and algorithms. Finally, our method improves the cross-architecture robustness of the distilled dataset, which is important in the application scenario.

Related papers

DD-Ranking: Rethinking the Evaluation of Dataset Distillation [223.28392857127733]
We propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods.<n>By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
arXiv Detail & Related papers (2025-05-19T16:19:50Z)
Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance. We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z)
Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD) CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z)
Generative Dataset Distillation Based on Self-knowledge Distillation [49.20086587208214]
We present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits. Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data. Our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
arXiv Detail & Related papers (2025-01-08T00:43:31Z)
Heavy Labels Out! Dataset Distillation with Label Space Lightening [69.67681224137561]
HeLlO aims at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. We demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets.
arXiv Detail & Related papers (2024-08-15T15:08:58Z)
A Label is Worth a Thousand Images in Dataset Distillation [16.272675455429006]
Data $textitquality$ is a crucial factor in the performance of machine learning models. We show that the main factor explaining the performance of state-of-the-art distillation methods is not the techniques used to generate synthetic data but rather the use of soft labels.
arXiv Detail & Related papers (2024-06-15T03:30:29Z)
Mitigating Bias in Dataset Distillation [62.79454960378792]
We study the impact of bias inside the original dataset on the performance of dataset distillation. We introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation.
arXiv Detail & Related papers (2024-06-06T18:52:28Z)
ATOM: Attention Mixer for Efficient Dataset Distillation [17.370852204228253]
We propose a module to efficiently distill large datasets using a mixture of channel and spatial-wise attention. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets.
arXiv Detail & Related papers (2024-05-02T15:15:01Z)
Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets. dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset. We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z)
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance. PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets. Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z)
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset. We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z)
Dataset Distillation via Factorization [58.8114016318593]
We introduce a emphdataset factorization approach, termed emphHaBa, which is a plug-and-play strategy portable to any existing dataset distillation (DD) baseline. emphHaBa explores decomposing a dataset into two components: data emphHallucination networks and emphBases. Our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65%.
arXiv Detail & Related papers (2022-10-30T08:36:19Z)
Flexible Dataset Distillation: Learn Labels Instead of Images [44.73351338165214]
Distilling labels with our new algorithm leads to improved results over prior image-based distillation. We show it to be more effective than the prior image-based approach to dataset distillation.
arXiv Detail & Related papers (2020-06-15T17:37:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.