Related papers: Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

URL: http://arxiv.org/abs/2305.18381v3
Date: Wed, 29 Nov 2023 10:46:19 GMT
Title: Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection
Authors: Yue Xu, Yong-Lu Li, Kaitong Cui, Ziyu Wang, Cewu Lu, Yu-Wing Tai, Chi-Keung Tang
Abstract summary: We model the dataset distillation task within the context of information transport. We introduce and validate a family of data utility estimators and optimal data selection methods to exploit the most valuable samples. Our method consistently enhances the distillation algorithms, even on much larger-scale and more heterogeneous datasets.
Score: 101.78275454476311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation becomes an effective approach for data-efficiency; however, the distillation process itself can still be inefficient. In this work, we model the dataset distillation task within the context of information transport. By observing the substantial data redundancy inherent in the distillation, we argue to put more emphasis on the samples' utility for the distillation task. We introduce and validate a family of data utility estimators and optimal data selection methods to exploit the most valuable samples. This strategy significantly reduces the training costs and extends various existing distillation algorithms to larger and more diversified datasets, e.g., in some cases only 0.04% training data is sufficient for comparable distillation performance. Our method consistently enhances the distillation algorithms, even on much larger-scale and more heterogeneous datasets, e.g. ImageNet-1K and Kinetics-400. This paradigm opens up new avenues in the dynamics of distillation and paves the way for efficient dataset distillation. Our code is available on https://github.com/silicx/GoldFromOres .

Related papers

DD-Ranking: Rethinking the Evaluation of Dataset Distillation [223.28392857127733]
We propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods.<n>By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
arXiv Detail & Related papers (2025-05-19T16:19:50Z)
Robust Dataset Distillation by Matching Adversarial Trajectories [21.52323435014135]
We introduce the task of robust dataset distillation", a novel paradigm that embeds adversarial robustness into synthetic datasets during the distillation process. We propose Matching Adversarial Trajectories (MAT), a method that integrates adversarial training into trajectory-based dataset distillation. MAT incorporates adversarial samples during trajectory generation to obtain robust training trajectories, which are then used to guide the distillation process.
arXiv Detail & Related papers (2025-03-15T10:02:38Z)
Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD) CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z)
Label-Augmented Dataset Distillation [13.449340904911725]
We introduce Label-Augmented dataset Distillation (LADD) to enhance dataset distillation with label augmentations. LADD sub-samples each synthetic image, generating additional dense labels to capture rich semantics. With three high-performance dataset distillation algorithms, LADD achieves remarkable gains by an average of 14.9% in accuracy.
arXiv Detail & Related papers (2024-09-24T16:54:22Z)
Mitigating Bias in Dataset Distillation [62.79454960378792]
We study the impact of bias inside the original dataset on the performance of dataset distillation. We introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation.
arXiv Detail & Related papers (2024-06-06T18:52:28Z)
Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification [0.0]
Main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy. Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method.
arXiv Detail & Related papers (2024-03-25T19:15:19Z)
Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets. dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset. We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z)
Dataset Distillation via Adversarial Prediction Matching [24.487950991247764]
We propose an adversarial framework to solve the dataset distillation problem efficiently. Our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets.
arXiv Detail & Related papers (2023-12-14T13:19:33Z)
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance. PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets. Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z)
Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data. Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data. Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z)
New Properties of the Data Distillation Method When Working With Tabular Data [77.34726150561087]
Data distillation is the problem of reducing the volume oftraining data while keeping only the necessary information. We show that the model trained on distilled samples can outperform the model trained on the original dataset.
arXiv Detail & Related papers (2020-10-19T20:27:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.