Dataset Distillation in Large Data Era
- URL: http://arxiv.org/abs/2311.18838v1
- Date: Thu, 30 Nov 2023 18:59:56 GMT
- Title: Dataset Distillation in Large Data Era
- Authors: Zeyuan Yin and Zhiqiang Shen
- Abstract summary: We show how to distill various large-scale datasets such as full ImageNet-1K/21K under a conventional input resolution of 224$times$224.
We show that the proposed model beats the current state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K.
- Score: 31.758821805424393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset distillation aims to generate a smaller but representative subset
from a large dataset, which allows a model to be trained efficiently, meanwhile
evaluating on the original testing data distribution to achieve decent
performance. Many prior works have aimed to align with diverse aspects of the
original datasets, such as matching the training weight trajectories, gradient,
feature/BatchNorm distributions, etc. In this work, we show how to distill
various large-scale datasets such as full ImageNet-1K/21K under a conventional
input resolution of 224$\times$224 to achieve the best accuracy over all
previous approaches, including SRe$^2$L, TESLA and MTT. To achieve this, we
introduce a simple yet effective ${\bf C}$urriculum ${\bf D}$ata ${\bf
A}$ugmentation ($\texttt{CDA}$) during data synthesis that obtains the accuracy
on large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50
and 36.1% under IPC 20, respectively. Finally, we show that, by integrating all
our enhancements together, the proposed model beats the current
state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the
first time, reduces the gap to its full-data training counterpart to less than
absolute 15%. Moreover, this work represents the inaugural success in dataset
distillation on larger-scale ImageNet-21K under the standard 224$\times$224
resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery
budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.
Related papers
- Teddy: Efficient Large-Scale Dataset Distillation via Taylor-Approximated Matching [74.75248610868685]
Teddy is a Taylor-approximated dataset distillation framework designed to handle large-scale dataset.
Teddy attains state-of-the-art efficiency and performance on the Tiny-ImageNet and original-sized ImageNet-1K dataset.
arXiv Detail & Related papers (2024-10-10T03:28:46Z) - Distributional Dataset Distillation with Subtask Decomposition [18.288856447840303]
We show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K datasets.
Specifically, we outperform the prior art by $6.9%$ on ImageNet-1K under the storage budget of 2 images per class.
arXiv Detail & Related papers (2024-03-01T21:49:34Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Dataset Distillation via Adversarial Prediction Matching [24.487950991247764]
We propose an adversarial framework to solve the dataset distillation problem efficiently.
Our method can produce synthetic datasets just 10% the size of the original, yet achieve, on average, 94% of the test accuracy of models trained on the full original datasets.
arXiv Detail & Related papers (2023-12-14T13:19:33Z) - Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching [24.45182507244476]
Generalized Various Backbone and Statistical Matching (G-VBSM) is first algorithm to obtain strong performance across both small-scale and large-scale datasets.
G-VBSM achieves a performance of 38.7% on CIFAR-100 with 128-width ConvNet, 47.6% on Tiny-ImageNet with ResNet18, and 31.4% on the full 224x224 ImageNet-1k with ResNet18.
arXiv Detail & Related papers (2023-11-29T06:25:59Z) - Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale
From A New Perspective [27.650434284271363]
Under 50 IPC, our approach achieves the highest 42.5% and 60.8% validation accuracy on Tiny-ImageNet and ImageNet-1K datasets.
Our approach also surpasses MTT in terms of speed by approximately 52$times$ (ConvNet-4) and 16$times$ (ResNet-18) faster with less memory consumption of 11.6$times$ and 6.4$times$ during data synthesis.
arXiv Detail & Related papers (2023-06-22T17:59:58Z) - Large-scale Dataset Pruning with Dynamic Uncertainty [28.60845105174658]
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them.
In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop.
arXiv Detail & Related papers (2023-06-08T13:14:35Z) - Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness
with Dataset Reinforcement [68.44100784364987]
We propose a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users.
We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+.
Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks.
arXiv Detail & Related papers (2023-03-15T23:10:17Z) - Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K.
We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint.
The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet
without Tricks [57.69809561405253]
We introduce a framework that is able to boost the vanilla ResNet-50 to 80%+ Top-1 accuracy on ImageNet without tricks.
Our method obtains 80.67% top-1 accuracy on ImageNet using a single crop-size of 224x224 with vanilla ResNet-50.
Our framework consistently improves from 69.76% to 73.19% on smaller ResNet-18.
arXiv Detail & Related papers (2020-09-17T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.