SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching
- URL: http://arxiv.org/abs/2406.18561v1
- Date: Tue, 28 May 2024 06:54:04 GMT
- Title: SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching
- Authors: Yongmin Lee, Hye Won Chung,
- Abstract summary: We introduce SelMatch, a novel distillation method that effectively scales with IPC.
When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods.
- Score: 10.696635172502141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation aims to synthesize a small number of images per class (IPC) from a large dataset to approximate full dataset training with minimal performance loss. While effective in very small IPC ranges, many distillation methods become less effective, even underperforming random sample selection, as IPC increases. Our examination of state-of-the-art trajectory-matching based distillation methods across various IPC scales reveals that these methods struggle to incorporate the complex, rare features of harder samples into the synthetic dataset even with the increased IPC, resulting in a persistent coverage gap between easy and hard test samples. Motivated by such observations, we introduce SelMatch, a novel distillation method that effectively scales with IPC. SelMatch uses selection-based initialization and partial updates through trajectory matching to manage the synthetic dataset's desired difficulty level tailored to IPC scales. When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods across subset ratios from 5% to 30%.
Related papers
- Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation [67.34754791044242]
We introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation.
CCFS employs a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum.
arXiv Detail & Related papers (2025-03-24T16:47:40Z) - DELT: A Simple Diversity-driven EarlyLate Training for Dataset Distillation [23.02066055996762]
We propose a new Diversity-driven EarlyLate Training (DELT) scheme to enhance the diversity of images in batch-to-global matching.
Our approach is conceptually simple yet effective, it partitions predefined IPC samples into smaller subtasks and employs local optimizations.
Our approach outperforms the previous state-of-the-art by 2$sim$5% on average across different datasets and IPCs (images per class)
arXiv Detail & Related papers (2024-11-29T18:59:46Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - Data curation via joint example selection further accelerates multimodal learning [3.329535792151987]
We show that jointly selecting batches of data is more effective for learning than selecting examples independently.
We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points.
arXiv Detail & Related papers (2024-06-25T16:52:37Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Dataset Distillation via Adversarial Prediction Matching [24.487950991247764]
We propose an adversarial framework to solve the dataset distillation problem efficiently.
Our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets.
arXiv Detail & Related papers (2023-12-14T13:19:33Z) - Embarassingly Simple Dataset Distillation [0.0]
We tackle dataset distillation at its core by treating it directly as a bilevel optimization problem.
A deeper dive into the nature of distilled data unveils pronounced intercorrelation.
We devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.
arXiv Detail & Related papers (2023-11-13T02:14:54Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - DREAM: Efficient Dataset Distillation by Representative Matching [38.92087223000823]
We propose a novel matching strategy named as textbfDataset distillation by textbfREpresenttextbfAtive textbfMatching (DREAM)
DREAM is able to be easily plugged into popular dataset distillation frameworks and reduce the distilling iterations by more than 8 times without performance drop.
arXiv Detail & Related papers (2023-02-28T08:48:45Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - Dataset Distillation with Infinitely Wide Convolutional Networks [18.837952916998947]
We apply distributed kernel based meta-learning framework to achieve state-of-the-art results for dataset distillation.
We obtain over 64% test accuracy on CIFAR-10 image classification task, a dramatic improvement over the previous best test accuracy of 40%.
Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN.
arXiv Detail & Related papers (2021-07-27T18:31:42Z) - Capturing patterns of variation unique to a specific dataset [68.8204255655161]
We propose a tuning-free method that identifies low-dimensional representations of a target dataset relative to one or more comparison datasets.
We show in several experiments that UCA with a single background dataset achieves similar results compared to cPCA with various tuning parameters.
arXiv Detail & Related papers (2021-04-16T15:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.