Distributional Dataset Distillation with Subtask Decomposition
- URL: http://arxiv.org/abs/2403.00999v1
- Date: Fri, 1 Mar 2024 21:49:34 GMT
- Title: Distributional Dataset Distillation with Subtask Decomposition
- Authors: Tian Qin, Zhiwei Deng, David Alvarez-Melis
- Abstract summary: We show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K datasets.
Specifically, we outperform the prior art by $6.9%$ on ImageNet-1K under the storage budget of 2 images per class.
- Score: 18.288856447840303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What does a neural network learn when training from a task-specific dataset?
Synthesizing this knowledge is the central idea behind Dataset Distillation,
which recent work has shown can be used to compress large datasets into a small
set of input-label pairs ($\textit{prototypes}$) that capture essential aspects
of the original dataset. In this paper, we make the key observation that
existing methods distilling into explicit prototypes are very often suboptimal,
incurring in unexpected storage cost from distilled labels. In response, we
propose $\textit{Distributional Dataset Distillation}$ (D3), which encodes the
data using minimal sufficient per-class statistics and paired with a decoder,
we distill dataset into a compact distributional representation that is more
memory-efficient compared to prototype-based methods. To scale up the process
of learning these representations, we propose $\textit{Federated
distillation}$, which decomposes the dataset into subsets, distills them in
parallel using sub-task experts and then re-aggregates them. We thoroughly
evaluate our algorithm on a three-dimensional metric and show that our method
achieves state-of-the-art results on TinyImageNet and ImageNet-1K.
Specifically, we outperform the prior art by $6.9\%$ on ImageNet-1K under the
storage budget of 2 images per class.
Related papers
- Dataset Distillation via Adversarial Prediction Matching [24.487950991247764]
We propose an adversarial framework to solve the dataset distillation problem efficiently.
Our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets.
arXiv Detail & Related papers (2023-12-14T13:19:33Z) - Dataset Distillation in Large Data Era [31.758821805424393]
We show how to distill various large-scale datasets such as full ImageNet-1K/21K under a conventional input resolution of 224$times$224.
We show that the proposed model beats the current state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K.
arXiv Detail & Related papers (2023-11-30T18:59:56Z) - Data Distillation Can Be Like Vodka: Distilling More Times For Better
Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance.
PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets.
Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets [4.833815605196965]
This paper presents a novel method for addressing data imbalance in machine learning.
It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering.
It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
arXiv Detail & Related papers (2023-08-19T02:11:49Z) - Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images.
The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data.
We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z) - DiM: Distilling Dataset into Generative Model [42.32433831074992]
We propose a novel distillation scheme to textbfDistill information of large train sets textbfinto generative textbfModels, named DiM.
During the distillation phase, we minimize the differences in logits predicted by a models pool between real and generated images.
At the deployment stage, the generative model synthesizes various training samples from random noises on the fly.
arXiv Detail & Related papers (2023-03-08T16:48:24Z) - Scrape, Cut, Paste and Learn: Automated Dataset Generation Applied to
Parcel Logistics [58.720142291102135]
We present a fully automated pipeline to generate a synthetic dataset for instance segmentation in four steps.
We first scrape images for the objects of interest from popular image search engines.
We compare three different methods for image selection: Object-agnostic pre-processing, manual image selection and CNN-based image selection.
arXiv Detail & Related papers (2022-10-18T12:49:04Z) - Primitive3D: 3D Object Dataset Synthesis from Randomly Assembled
Primitives [44.03149443379618]
We propose a cost-effective method for automatically generating a large amount of 3D objects with annotations.
These objects are auto-annotated with part labels originating from primitives.
Considering the large overhead of learning on the generated dataset, we propose a dataset distillation strategy.
arXiv Detail & Related papers (2022-05-25T10:07:07Z) - DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort [117.41383937100751]
Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets.
We show how the GAN latent code can be decoded to produce a semantic segmentation of the image.
These generated datasets can then be used for training any computer vision architecture just as real datasets are.
arXiv Detail & Related papers (2021-04-13T20:08:29Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.