Related papers: Improved Distribution Matching for Dataset Condensation

Improved Distribution Matching for Dataset Condensation

URL: http://arxiv.org/abs/2307.09742v1
Date: Wed, 19 Jul 2023 04:07:33 GMT
Title: Improved Distribution Matching for Dataset Condensation
Authors: Ganlong Zhao, Guanbin Li, Yipeng Qin, Yizhou Yu
Abstract summary: We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
Score: 91.55972945798531
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dataset Condensation aims to condense a large dataset into a smaller one while maintaining its ability to train a well-performing model, thus reducing the storage cost and training effort in deep learning applications. However, conventional dataset condensation methods are optimization-oriented and condense the dataset by performing gradient or parameter matching during model optimization, which is computationally intensive even on small datasets and models. In this paper, we propose a novel dataset condensation method based on distribution matching, which is more efficient and promising. Specifically, we identify two important shortcomings of naive distribution matching (i.e., imbalanced feature numbers and unvalidated embeddings for distance computation) and address them with three novel techniques (i.e., partitioning and expansion augmentation, efficient and enriched model sampling, and class-aware distribution regularization). Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources, thereby scaling data condensation to larger datasets and models. Extensive experiments demonstrate the effectiveness of our method. Codes are available at https://github.com/uitrbn/IDM

Related papers

Dataset Distillation as Pushforward Optimal Quantization [1.039189397779466]
We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors.
arXiv Detail & Related papers (2025-01-13T20:41:52Z)
Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity. Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z)
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws [59.03420759554073]
We introduce Adaptive Data Optimization (ADO), an algorithm that optimize data distributions in an online fashion, concurrent with model training. ADO does not require external knowledge, proxy models, or modifications to the model update. ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly.
arXiv Detail & Related papers (2024-10-15T17:47:44Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
M3D: Dataset Condensation by Minimizing Maximum Mean Discrepancy [26.227927019615446]
Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in substantial training and storage costs. dataset condensation has been developed to learn a small synthetic set that preserves essential information from the original large-scale dataset. We present a novel DM-based method named M3D for dataset condensation by Minimizing the Maximum Mean Discrepancy.
arXiv Detail & Related papers (2023-12-26T07:45:32Z)
Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST) IST is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z)
Towards Efficient Deep Hashing Retrieval: Condensing Your Data via Feature-Embedding Matching [7.908244841289913]
The expenses involved in training state-of-the-art deep hashing retrieval models have witnessed an increase. The state-of-the-art dataset distillation methods can not expand to all deep hashing retrieval methods. We propose an efficient condensation framework that addresses these limitations by matching the feature-embedding between synthetic set and real set.
arXiv Detail & Related papers (2023-05-29T13:23:55Z)
Accelerating Dataset Distillation via Model Augmentation [41.3027484667024]
We propose two model augmentation techniques, i.e. using early-stage models and parameter parameters to learn an informative synthetic set with significantly reduced training cost. Our method achieves up to 20x speedup and comparable performance on par with state-of-the-art methods.
arXiv Detail & Related papers (2022-12-12T07:36:05Z)
DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods. The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z)
Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z)
Learning Distributionally Robust Models at Scale via Composite Optimization [45.47760229170775]
We show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
arXiv Detail & Related papers (2022-03-17T20:47:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.