Dataset Condensation via Efficient Synthetic-Data Parameterization
- URL: http://arxiv.org/abs/2205.14959v2
- Date: Thu, 2 Jun 2022 05:45:02 GMT
- Title: Dataset Condensation via Efficient Synthetic-Data Parameterization
- Authors: Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song,
Joonhyun Jeong, Jung-Woo Ha, Hyun Oh Song
- Abstract summary: Machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning.
Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset.
We propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity.
- Score: 40.56817483607132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The great success of machine learning with massive amounts of data comes at a
price of huge computation costs and storage for training and tuning. Recent
studies on dataset condensation attempt to reduce the dependence on such
massive data by synthesizing a compact training dataset. However, the existing
approaches have fundamental limitations in optimization due to the limited
representability of synthetic datasets without considering any data regularity
characteristics. To this end, we propose a novel condensation framework that
generates multiple synthetic data with a limited storage budget via efficient
parameterization considering data regularity. We further analyze the
shortcomings of the existing gradient matching-based condensation methods and
develop an effective optimization technique for improving the condensation of
training data information. We propose a unified algorithm that drastically
improves the quality of condensed data against the current state-of-the-art on
CIFAR-10, ImageNet, and Speech Commands.
Related papers
- Hierarchical Features Matter: A Deep Exploration of GAN Priors for Improved Dataset Distillation [51.44054828384487]
We propose a novel parameterization method dubbed Hierarchical Generative Latent Distillation (H-GLaD)
This method systematically explores hierarchical layers within the generative adversarial networks (GANs)
In addition, we introduce a novel class-relevant feature distance metric to alleviate the computational burden associated with synthetic dataset evaluation.
arXiv Detail & Related papers (2024-06-09T09:15:54Z) - CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting [22.473436770730657]
The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets.
In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input.
In TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models.
arXiv Detail & Related papers (2024-06-04T09:18:20Z) - Calibrated Dataset Condensation for Faster Hyperparameter Search [23.790315967011345]
State-of-the-art approaches rely on matching the model gradients between the real and synthetic data.
This paper considers a different condensation objective specifically geared toward hyperparameter search.
arXiv Detail & Related papers (2024-05-27T17:55:01Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Towards Efficient Deep Hashing Retrieval: Condensing Your Data via
Feature-Embedding Matching [7.908244841289913]
The expenses involved in training state-of-the-art deep hashing retrieval models have witnessed an increase.
The state-of-the-art dataset distillation methods can not expand to all deep hashing retrieval methods.
We propose an efficient condensation framework that addresses these limitations by matching the feature-embedding between synthetic set and real set.
arXiv Detail & Related papers (2023-05-29T13:23:55Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.