Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation
- URL: http://arxiv.org/abs/2408.12483v1
- Date: Thu, 22 Aug 2024 15:20:32 GMT
- Title: Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation
- Authors: Shaobo Wang, Yantai Yang, Qilong Wang, Kaixin Li, Linfeng Zhang, Junchi Yan,
- Abstract summary: We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
- Score: 57.6797306341115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset Distillation (DD) aims to synthesize a small dataset capable of performing comparably to the original dataset. Despite the success of numerous DD methods, theoretical exploration of this area remains unaddressed. In this paper, we take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We begin by empirically examining sample difficulty, measured by gradient norm, and observe that different matching-based methods roughly correspond to specific difficulty tendencies. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. Our findings suggest that prioritizing the synthesis of easier samples from the original dataset can enhance the quality of distilled datasets, especially in low IPC (image-per-class) settings. Based on our empirical observations and theoretical analysis, we introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality. Our SDC can be seamlessly integrated into existing methods as a plugin with minimal code adjustments. Experimental results demonstrate that adding SDC generates higher-quality distilled datasets across 7 distillation methods and 6 datasets.
Related papers
- Dataset Distillation as Pushforward Optimal Quantization [1.039189397779466]
We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation.
We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem.
In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors.
arXiv Detail & Related papers (2025-01-13T20:41:52Z) - Adaptive Dataset Quantization [2.0105434963031463]
We introduce a versatile framework for dataset compression, namely Adaptive dataset Quantization (ADQ)
We propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score.
Our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
arXiv Detail & Related papers (2024-12-22T07:08:29Z) - Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information [43.44508080585033]
We introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset.
We minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset.
arXiv Detail & Related papers (2024-12-13T08:10:47Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - Embarassingly Simple Dataset Distillation [0.0]
We tackle dataset distillation at its core by treating it directly as a bilevel optimization problem.
A deeper dive into the nature of distilled data unveils pronounced intercorrelation.
We devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.
arXiv Detail & Related papers (2023-11-13T02:14:54Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled
Learning [42.26185670834855]
Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples.
This paper focuses on improving the commonly-used nnPU with a novel training pipeline.
arXiv Detail & Related papers (2022-11-30T05:48:31Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - ScoreMix: A Scalable Augmentation Strategy for Training GANs with
Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available.
We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.