Distributional Dataset Distillation with Subtask Decomposition
- URL: http://arxiv.org/abs/2403.00999v1
- Date: Fri, 1 Mar 2024 21:49:34 GMT
- Title: Distributional Dataset Distillation with Subtask Decomposition
- Authors: Tian Qin, Zhiwei Deng, David Alvarez-Melis
- Abstract summary: We show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K datasets.
Specifically, we outperform the prior art by $6.9%$ on ImageNet-1K under the storage budget of 2 images per class.
- Score: 18.288856447840303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What does a neural network learn when training from a task-specific dataset?
Synthesizing this knowledge is the central idea behind Dataset Distillation,
which recent work has shown can be used to compress large datasets into a small
set of input-label pairs ($\textit{prototypes}$) that capture essential aspects
of the original dataset. In this paper, we make the key observation that
existing methods distilling into explicit prototypes are very often suboptimal,
incurring in unexpected storage cost from distilled labels. In response, we
propose $\textit{Distributional Dataset Distillation}$ (D3), which encodes the
data using minimal sufficient per-class statistics and paired with a decoder,
we distill dataset into a compact distributional representation that is more
memory-efficient compared to prototype-based methods. To scale up the process
of learning these representations, we propose $\textit{Federated
distillation}$, which decomposes the dataset into subsets, distills them in
parallel using sub-task experts and then re-aggregates them. We thoroughly
evaluate our algorithm on a three-dimensional metric and show that our method
achieves state-of-the-art results on TinyImageNet and ImageNet-1K.
Specifically, we outperform the prior art by $6.9\%$ on ImageNet-1K under the
storage budget of 2 images per class.
Related papers
- Information-Guided Diffusion Sampling for Dataset Distillation [44.216998537570866]
Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings.<n>We identify two key types of information that a distilled dataset must preserve.<n>Experiments on Tiny ImageNet and ImageNet subsets show that information-guided diffusion sampling (IGDS) significantly outperforms existing methods.
arXiv Detail & Related papers (2025-07-07T02:27:08Z) - OD3: Optimization-free Dataset Distillation for Object Detection [23.09565778268426]
We introduce OD3, a novel optimization-free data distillation framework specifically designed for object detection.<n>Our approach involves two stages: first, a candidate selection process in which object instances are iteratively placed in synthesized images based on their suitable locations, and second, a candidate screening process using a pre-trained observer model to remove low-confidence objects.<n>Compared to the prior solely existing dataset distillation method on detection and conventional core set selection methods, OD3 delivers superior accuracy, establishes new state-of-the-art results, surpassing prior best method by more than 14% on COCO mAP50 at a compression ratio of
arXiv Detail & Related papers (2025-06-02T17:56:02Z) - Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures.
Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance.
We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z) - Curriculum Dataset Distillation [33.167484258219766]
We present a curriculum-based dataset distillation framework aiming to harmonize performance and scalability.<n>This framework strategically distills synthetic images, adhering to a curriculum that transitions from simple to complex.<n>Our framework sets new benchmarks in large-scale dataset distillation, achieving substantial improvements of 11.1% on Tiny-ImageNet, 9.0% on ImageNet-1K, and 7.3% on ImageNet-21K.
arXiv Detail & Related papers (2024-05-15T07:27:14Z) - Dataset Distillation via Adversarial Prediction Matching [24.487950991247764]
We propose an adversarial framework to solve the dataset distillation problem efficiently.
Our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets.
arXiv Detail & Related papers (2023-12-14T13:19:33Z) - Dataset Distillation via Curriculum Data Synthesis in Large Data Era [26.883100340763317]
We introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation during data synthesis.
The proposed model outperforms the current state-of-the-art methods like SRe$2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%.
arXiv Detail & Related papers (2023-11-30T18:59:56Z) - Data Distillation Can Be Like Vodka: Distilling More Times For Better
Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance.
PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets.
Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets [4.833815605196965]
This paper presents a novel method for addressing data imbalance in machine learning.
It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering.
It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
arXiv Detail & Related papers (2023-08-19T02:11:49Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images.
The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data.
We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z) - DiM: Distilling Dataset into Generative Model [42.32433831074992]
We propose a novel distillation scheme to textbfDistill information of large train sets textbfinto generative textbfModels, named DiM.
During the distillation phase, we minimize the differences in logits predicted by a models pool between real and generated images.
At the deployment stage, the generative model synthesizes various training samples from random noises on the fly.
arXiv Detail & Related papers (2023-03-08T16:48:24Z) - Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to
Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps.
We first scrape images for the objects of interest from popular image search engines.
We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z) - Primitive3D: 3D Object Dataset Synthesis from Randomly Assembled
Primitives [44.03149443379618]
We propose a cost-effective method for automatically generating a large amount of 3D objects with annotations.
These objects are auto-annotated with part labels originating from primitives.
Considering the large overhead of learning on the generated dataset, we propose a dataset distillation strategy.
arXiv Detail & Related papers (2022-05-25T10:07:07Z) - DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort [117.41383937100751]
Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets.
We show how the GAN latent code can be decoded to produce a semantic segmentation of the image.
These generated datasets can then be used for training any computer vision architecture just as real datasets are.
arXiv Detail & Related papers (2021-04-13T20:08:29Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.