Information-Guided Diffusion Sampling for Dataset Distillation
- URL: http://arxiv.org/abs/2507.04619v1
- Date: Mon, 07 Jul 2025 02:27:08 GMT
- Title: Information-Guided Diffusion Sampling for Dataset Distillation
- Authors: Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis,
- Abstract summary: Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings.<n>We identify two key types of information that a distilled dataset must preserve.<n>Experiments on Tiny ImageNet and ImageNet subsets show that information-guided diffusion sampling (IGDS) significantly outperforms existing methods.
- Score: 44.216998537570866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + \beta \mathrm{H}(X | Y)$ during the DM sampling process, where $\beta$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.
Related papers
- Solving dynamic portfolio selection problems via score-based diffusion models [3.8857791305699565]
We tackle the dynamic mean-variance portfolio selection problem in a it model-free manner, based on (generative) diffusion models.<n>We propose using data sampled from the real model $mathbb P$ with limited size to train a generative model $mathbb Q$.<n>With adaptive training and sampling methods that are tailor-made for time series data, we obtain bounds between $mathbb P$ and $mathbb Q$.
arXiv Detail & Related papers (2025-07-14T04:41:49Z) - Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z) - Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Local Flow Matching Generative Models [19.859984725284896]
Local Flow Matching is a computational framework for density estimation based on flow-based generative models.<n>$textttLFM$ employs a simulation-free scheme and incrementally learns a sequence of Flow Matching sub-models.<n>We demonstrate the improved training efficiency and competitive generative performance of $textttLFM$ compared to FM.
arXiv Detail & Related papers (2024-10-03T14:53:10Z) - Distributional Dataset Distillation with Subtask Decomposition [18.288856447840303]
We show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K datasets.
Specifically, we outperform the prior art by $6.9%$ on ImageNet-1K under the storage budget of 2 images per class.
arXiv Detail & Related papers (2024-03-01T21:49:34Z) - Variance Alignment Score: A Simple But Tough-to-Beat Data Selection
Method for Multimodal Contrastive Learning [17.40655778450583]
We propose a principled metric named Variance Alignment Score (VAS), which has the form $langle Sigma_texttest, Sigma_irangle$.
We show that applying VAS and CLIP scores together can outperform baselines by a margin of $1.3%$ on 38 evaluation sets for noisy dataset DataComp and $2.5%$ on VTAB for high-quality dataset CC12M.
arXiv Detail & Related papers (2024-02-03T06:29:04Z) - DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets [4.833815605196965]
This paper presents a novel method for addressing data imbalance in machine learning.
It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering.
It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
arXiv Detail & Related papers (2023-08-19T02:11:49Z) - Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information [67.25713071340518]
Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans.<n>We frame dataset difficulty as the lack of $mathcalV$-$textitusable information.<n>We also introduce $textitpointwise $mathcalV$-information$ (PVI) for measuring the difficulty of individual instances.
arXiv Detail & Related papers (2021-10-16T00:21:42Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - On the Difference Between the Information Bottleneck and the Deep
Information Bottleneck [81.89141311906552]
We revisit the Deep Variational Information Bottleneck and the assumptions needed for its derivation.
We show how to circumvent this limitation by optimising a lower bound for $I(T;Y)$ for which only the latter Markov chain has to be satisfied.
arXiv Detail & Related papers (2019-12-31T18:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.