Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
- URL: http://arxiv.org/abs/2503.18872v1
- Date: Mon, 24 Mar 2025 16:47:40 GMT
- Title: Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation
- Authors: Yanda Chen, Gongwei Chen, Miao Zhang, Weili Guan, Liqiang Nie,
- Abstract summary: We introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation.<n>CCFS employs a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum.
- Score: 67.34754791044242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation (DD) excels in synthesizing a small number of images per class (IPC) but struggles to maintain its effectiveness in high-IPC settings. Recent works on dataset distillation demonstrate that combining distilled and real data can mitigate the effectiveness decay. However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. CCFS employs a curriculum selection framework for real data selection, where we leverage a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum. Extensive experiments validate CCFS, surpassing the state-of-the-art by +6.6\% on CIFAR-10, +5.8\% on CIFAR-100, and +3.4\% on Tiny-ImageNet under high-IPC settings. Notably, CCFS achieves 60.2\% test accuracy on ResNet-18 with a 20\% compression ratio of Tiny-ImageNet, closely matching full-dataset training with only 0.3\% degradation. Code: https://github.com/CYDaaa30/CCFS.
Related papers
- Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents [62.616106562146776]
We propose a textbfVisual-Centric textbfSelection approach via textbfAgents Collaboration (ViSA)
Our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images.
arXiv Detail & Related papers (2025-02-27T09:37:30Z) - Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures.<n>Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance.<n>We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z) - SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching [10.696635172502141]
We introduce SelMatch, a novel distillation method that effectively scales with IPC.
When tested on CIFAR-10/100 and TinyImageNet, SelMatch consistently outperforms leading selection-only and distillation-only methods.
arXiv Detail & Related papers (2024-05-28T06:54:04Z) - Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data [40.37396692278567]
We focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification.
The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance.
We find that this approach surprisingly fails in true zero-shot settings when using contrastive losses.
arXiv Detail & Related papers (2024-04-25T14:24:41Z) - Distilling Datasets Into Less Than One Image [39.08927346274156]
We push the boundaries of dataset distillation, compressing the dataset into less than an image-per-class.
Our method establishes a new state-of-the-art performance on CIFAR-10, CIFAR-100, and CUB200 using as little as 0.3 images-per-class.
arXiv Detail & Related papers (2024-03-18T17:59:49Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - DataDAM: Efficient Dataset Distillation with Attention Matching [14.44010988811002]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets.<n>Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset.<n>However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - CUDA: Convolution-based Unlearnable Datasets [77.70422525613084]
Large-scale training of modern deep learning models heavily relies on publicly available data on the web.
Recent works aim to make data for deep learning models by adding small, specially designed noises.
These methods are vulnerable to adversarial training (AT) and/or are computationally heavy.
arXiv Detail & Related papers (2023-03-07T22:57:23Z) - Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K.
We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint.
The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z) - Dataset Distillation with Infinitely Wide Convolutional Networks [18.837952916998947]
We apply distributed kernel based meta-learning framework to achieve state-of-the-art results for dataset distillation.
We obtain over 64% test accuracy on CIFAR-10 image classification task, a dramatic improvement over the previous best test accuracy of 40%.
Our state-of-the-art results extend across many other settings for MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and SVHN.
arXiv Detail & Related papers (2021-07-27T18:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.