MIM4DD: Mutual Information Maximization for Dataset Distillation
- URL: http://arxiv.org/abs/2312.16627v1
- Date: Wed, 27 Dec 2023 16:22:50 GMT
- Title: MIM4DD: Mutual Information Maximization for Dataset Distillation
- Authors: Yuzhang Shang, Zhihang Yuan, Yan Yan
- Abstract summary: We introduce mutual information (MI) as the metric to quantify the shared information between the synthetic and the real datasets.
We devise MIM4DD numerically maximizing the MI via a newly designed optimizable within a contrastive learning framework.
Experiment results show that MIM4DD can be implemented as an add-on module to existing SoTA DD methods.
- Score: 15.847690902246727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation (DD) aims to synthesize a small dataset whose test
performance is comparable to a full dataset using the same model.
State-of-the-art (SoTA) methods optimize synthetic datasets primarily by
matching heuristic indicators extracted from two networks: one from real data
and one from synthetic data (see Fig.1, Left), such as gradients and training
trajectories. DD is essentially a compression problem that emphasizes
maximizing the preservation of information contained in the data. We argue that
well-defined metrics which measure the amount of shared information between
variables in information theory are necessary for success measurement but are
never considered by previous works. Thus, we introduce mutual information (MI)
as the metric to quantify the shared information between the synthetic and the
real datasets, and devise MIM4DD numerically maximizing the MI via a newly
designed optimizable objective within a contrastive learning framework to
update the synthetic dataset. Specifically, we designate the samples in
different datasets that share the same labels as positive pairs and vice versa
negative pairs. Then we respectively pull and push those samples in positive
and negative pairs into contrastive space via minimizing NCE loss. As a result,
the targeted MI can be transformed into a lower bound represented by feature
maps of samples, which is numerically feasible. Experiment results show that
MIM4DD can be implemented as an add-on module to existing SoTA DD methods.
Related papers
- Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information [43.44508080585033]
We introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset.
We minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset.
arXiv Detail & Related papers (2024-12-13T08:10:47Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - MDM: Advancing Multi-Domain Distribution Matching for Automatic Modulation Recognition Dataset Synthesis [35.07663680944459]
Deep learning technology has been successfully introduced into Automatic Modulation Recognition (AMR) tasks.
The success of deep learning is all attributed to the training on large-scale datasets.
In order to solve the problem of large amount of data, some researchers put forward the method of data distillation.
arXiv Detail & Related papers (2024-08-05T14:16:54Z) - Dataset Condensation with Latent Quantile Matching [5.466962214217334]
Current distribution matching (DM) based DC methods learn a synthesized dataset by matching the mean of the latent embeddings between the synthetic and the real outliers.
We propose Latent Quantile Matching (LQM) which matches the quantiles of the latent embeddings to minimize the goodness of fit test statistic between two distributions.
arXiv Detail & Related papers (2024-06-14T09:20:44Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Sequential Subset Matching for Dataset Distillation [44.322842898670565]
We propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch)
Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances.
Our code is available at https://github.com/shqii1j/seqmatch.
arXiv Detail & Related papers (2023-11-02T19:49:11Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - ScoreMix: A Scalable Augmentation Strategy for Training GANs with
Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available.
We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.