Related papers: Sequential Subset Matching for Dataset Distillation

Sequential Subset Matching for Dataset Distillation

URL: http://arxiv.org/abs/2311.01570v1
Date: Thu, 2 Nov 2023 19:49:11 GMT
Title: Sequential Subset Matching for Dataset Distillation
Authors: Jiawei Du, Qin Shi, Joey Tianyi Zhou
Abstract summary: We propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch) Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances. Our code is available at https://github.com/shqii1j/seqmatch.
Score: 44.322842898670565
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dataset distillation is a newly emerging task that synthesizes a small-size dataset used in training deep neural networks (DNNs) for reducing data storage and model training costs. The synthetic datasets are expected to capture the essence of the knowledge contained in real-world datasets such that the former yields a similar performance as the latter. Recent advancements in distillation methods have produced notable improvements in generating synthetic datasets. However, current state-of-the-art methods treat the entire synthetic dataset as a unified entity and optimize each synthetic instance equally. This static optimization approach may lead to performance degradation in dataset distillation. Specifically, we argue that static optimization can give rise to a coupling issue within the synthetic data, particularly when a larger amount of synthetic data is being optimized. This coupling issue, in turn, leads to the failure of the distilled dataset to extract the high-level features learned by the deep neural network (DNN) in the latter epochs. In this study, we propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch), which tackles this problem by adaptively optimizing the synthetic data to encourage sequential acquisition of knowledge during dataset distillation. Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances, thereby enhancing its performance significantly. Our proposed SeqMatch outperforms state-of-the-art methods in various datasets, including SVNH, CIFAR-10, CIFAR-100, and Tiny ImageNet. Our code is available at https://github.com/shqii1j/seqmatch.

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment [39.137060714048175]
We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets. We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
arXiv Detail & Related papers (2024-09-26T08:03:19Z)
Hierarchical Features Matter: A Deep Exploration of GAN Priors for Improved Dataset Distillation [51.44054828384487]
We propose a novel parameterization method dubbed Hierarchical Generative Latent Distillation (H-GLaD) This method systematically explores hierarchical layers within the generative adversarial networks (GANs) In addition, we introduce a novel class-relevant feature distance metric to alleviate the computational burden associated with synthetic dataset evaluation.
arXiv Detail & Related papers (2024-06-09T09:15:54Z)
Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets. dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset. We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z)
Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance. PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets. Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z)
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching [19.8751746334929]
We present an algorithm that remains effective as the size of the synthetic dataset grows. We experimentally find that the training stage of the trajectories we choose to match greatly affects the effectiveness of the distilled dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets.
arXiv Detail & Related papers (2023-10-09T14:57:41Z)
Bridging the Gap: Enhancing the Utility of Synthetic Data via Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data. We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset. Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images. The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data. We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z)
Dataset Distillation using Neural Feature Regression [32.53291298089172]
We develop an algorithm for dataset distillation using neural Feature Regression with Pooling (FRePo) FRePo achieves state-of-the-art performance with an order of magnitude less memory requirement and two orders of magnitude faster training than previous methods. We show that high-quality distilled data can greatly improve various downstream applications, such as continual learning and membership inference defense.
arXiv Detail & Related papers (2022-06-01T19:02:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.