Improved Distribution Matching for Dataset Condensation
- URL: http://arxiv.org/abs/2307.09742v1
- Date: Wed, 19 Jul 2023 04:07:33 GMT
- Title: Improved Distribution Matching for Dataset Condensation
- Authors: Ganlong Zhao, Guanbin Li, Yipeng Qin, Yizhou Yu
- Abstract summary: We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
- Score: 91.55972945798531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset Condensation aims to condense a large dataset into a smaller one
while maintaining its ability to train a well-performing model, thus reducing
the storage cost and training effort in deep learning applications. However,
conventional dataset condensation methods are optimization-oriented and
condense the dataset by performing gradient or parameter matching during model
optimization, which is computationally intensive even on small datasets and
models. In this paper, we propose a novel dataset condensation method based on
distribution matching, which is more efficient and promising. Specifically, we
identify two important shortcomings of naive distribution matching (i.e.,
imbalanced feature numbers and unvalidated embeddings for distance computation)
and address them with three novel techniques (i.e., partitioning and expansion
augmentation, efficient and enriched model sampling, and class-aware
distribution regularization). Our simple yet effective method outperforms most
previous optimization-oriented methods with much fewer computational resources,
thereby scaling data condensation to larger datasets and models. Extensive
experiments demonstrate the effectiveness of our method. Codes are available at
https://github.com/uitrbn/IDM
Related papers
- Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws [59.03420759554073]
We introduce Adaptive Data Optimization (ADO), an algorithm that optimize data distributions in an online fashion, concurrent with model training.
ADO does not require external knowledge, proxy models, or modifications to the model update.
ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly.
arXiv Detail & Related papers (2024-10-15T17:47:44Z) - High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy [26.227927019615446]
Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in substantial training and storage costs.
dataset condensation has been developed to learn a small synthetic set that preserves essential information from the original large-scale dataset.
We present a novel DM-based method named M3D for dataset condensation by Minimizing the Maximum Mean Discrepancy.
arXiv Detail & Related papers (2023-12-26T07:45:32Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Towards Efficient Deep Hashing Retrieval: Condensing Your Data via
Feature-Embedding Matching [7.908244841289913]
The expenses involved in training state-of-the-art deep hashing retrieval models have witnessed an increase.
The state-of-the-art dataset distillation methods can not expand to all deep hashing retrieval methods.
We propose an efficient condensation framework that addresses these limitations by matching the feature-embedding between synthetic set and real set.
arXiv Detail & Related papers (2023-05-29T13:23:55Z) - Accelerating Dataset Distillation via Model Augmentation [41.3027484667024]
We propose two model augmentation techniques, i.e. using early-stage models and parameter parameters to learn an informative synthetic set with significantly reduced training cost.
Our method achieves up to 20x speedup and comparable performance on par with state-of-the-art methods.
arXiv Detail & Related papers (2022-12-12T07:36:05Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z) - Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights.
Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs.
In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z) - Learning Distributionally Robust Models at Scale via Composite
Optimization [45.47760229170775]
We show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods.
We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
arXiv Detail & Related papers (2022-03-17T20:47:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.