Related papers: Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation

Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation

URL: http://arxiv.org/abs/2408.12483v1
Date: Thu, 22 Aug 2024 15:20:32 GMT
Title: Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation
Authors: Shaobo Wang, Yantai Yang, Qilong Wang, Kaixin Li, Linfeng Zhang, Junchi Yan,
Abstract summary: We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
Score: 57.6797306341115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dataset Distillation (DD) aims to synthesize a small dataset capable of performing comparably to the original dataset. Despite the success of numerous DD methods, theoretical exploration of this area remains unaddressed. In this paper, we take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We begin by empirically examining sample difficulty, measured by gradient norm, and observe that different matching-based methods roughly correspond to specific difficulty tendencies. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. Our findings suggest that prioritizing the synthesis of easier samples from the original dataset can enhance the quality of distilled datasets, especially in low IPC (image-per-class) settings. Based on our empirical observations and theoretical analysis, we introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality. Our SDC can be seamlessly integrated into existing methods as a plugin with minimal code adjustments. Experimental results demonstrate that adding SDC generates higher-quality distilled datasets across 7 distillation methods and 6 datasets.

Related papers

Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks [55.27114962330541]
We propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task.<n>Deep neural networks achieve remarkable performance but have time and storage-consuming training processes.
arXiv Detail & Related papers (2026-01-15T05:29:50Z)
Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation [36.444254126901065]
We propose Rectified Decoupled dataset Distillation (RD$3$) to generate compact synthetic datasets.<n>RD$3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.
arXiv Detail & Related papers (2025-09-24T03:47:04Z)
Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling [31.51048512214796]
dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset.<n>We propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better.<n>The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks.
arXiv Detail & Related papers (2025-07-04T06:38:02Z)
Efficient Dataset Distillation through Low-Rank Space Sampling [34.29086540681496]
This paper proposes a dataset distillation method based on Matching Training Trajectories with Low-rank Space Sampling. The synthetic data is represented by basis vectors and shared dimension mappers from these subspaces. The proposed method is tested on CIFAR-10, CIFAR-100, and SVHN datasets, and outperforms the baseline methods by an average of 9.9%.
arXiv Detail & Related papers (2025-03-11T02:59:17Z)
Dataset Distillation as Pushforward Optimal Quantization [1.039189397779466]
We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors.
arXiv Detail & Related papers (2025-01-13T20:41:52Z)
Adaptive Dataset Quantization [2.0105434963031463]
We introduce a versatile framework for dataset compression, namely Adaptive dataset Quantization (ADQ) We propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score. Our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
arXiv Detail & Related papers (2024-12-22T07:08:29Z)
Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information [43.44508080585033]
We introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset. We minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset.
arXiv Detail & Related papers (2024-12-13T08:10:47Z)
Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks [10.932880269282014]
We propose the first effective DD method for SSL pre-training. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders.
arXiv Detail & Related papers (2024-10-03T00:39:25Z)
Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets. dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset. We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z)
Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS) Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z)
Embarassingly Simple Dataset Distillation [0.0]
We tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. A deeper dive into the nature of distilled data unveils pronounced intercorrelation. We devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.
arXiv Detail & Related papers (2023-11-13T02:14:54Z)
Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z)
Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled Learning [42.26185670834855]
Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples. This paper focuses on improving the commonly-used nnPU with a novel training pipeline.
arXiv Detail & Related papers (2022-11-30T05:48:31Z)
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z)
ScoreMix: A Scalable Augmentation Strategy for Training GANs with Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available. We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z)
Efficient Classification with Counterfactual Reasoning and Active Learning [4.708737212700907]
Methods called CCRAL combine causal reasoning to learn counterfactual samples for the original training samples and active learning to select useful counterfactual samples based on a region of uncertainty. Experiments show that CCRAL achieves significantly better performance than those of the baselines in terms of accuracy and AUC.
arXiv Detail & Related papers (2022-07-25T12:03:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.