Dataset Quantization
- URL: http://arxiv.org/abs/2308.10524v1
- Date: Mon, 21 Aug 2023 07:24:29 GMT
- Title: Dataset Quantization
- Authors: Daquan Zhou, Kai Wang, Jianyang Gu, Xiangyu Peng, Dongze Lian, Yifan
Zhang, Yang You, Jiashi Feng
- Abstract summary: We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
- Score: 72.61936019738076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art deep neural networks are trained with large amounts
(millions or even billions) of data. The expensive computation and memory costs
make it difficult to train them on limited hardware resources, especially for
recent popular large language models (LLM) and computer vision models (CV).
Recent popular dataset distillation methods are thus developed, aiming to
reduce the number of training samples via synthesizing small-scale datasets via
gradient matching. However, as the gradient calculation is coupled with the
specific network architecture, the synthesized dataset is biased and performs
poorly when used for training unseen architectures. To address these
limitations, we present dataset quantization (DQ), a new framework to compress
large-scale datasets into small subsets which can be used for training any
neural network architectures. Extensive experiments demonstrate that DQ is able
to generate condensed small datasets for training unseen network architectures
with state-of-the-art compression ratios for lossless model training. To the
best of our knowledge, DQ is the first method that can successfully distill
large-scale datasets such as ImageNet-1k with a state-of-the-art compression
ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's
instruction tuning data, the models can be trained with negligible or no
performance drop for both vision tasks (including classification, semantic
segmentation, and object detection) as well as language tasks (including
instruction tuning tasks such as BBH and DROP).
Related papers
- LiteNeXt: A Novel Lightweight ConvMixer-based Model with Self-embedding Representation Parallel for Medical Image Segmentation [2.0901574458380403]
We propose a new lightweight but efficient model, namely LiteNeXt, for medical image segmentation.
LiteNeXt is trained from scratch with small amount of parameters (0.71M) and Giga Floating Point Operations Per Second (0.42).
arXiv Detail & Related papers (2024-04-04T01:59:19Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - Efficient deep learning models for land cover image classification [0.29748898344267777]
This work experiments with the BigEarthNet dataset for land use land cover (LULC) image classification.
We benchmark different state-of-the-art models, including Convolution Neural Networks, Multi-Layer Perceptrons, Visual Transformers, EfficientNets and Wide Residual Networks (WRN)
Our proposed lightweight model has an order of magnitude less trainable parameters, achieves 4.5% higher averaged f-score classification accuracy for all 19 LULC classes and is trained two times faster with respect to a ResNet50 state-of-the-art model that we use as a baseline.
arXiv Detail & Related papers (2021-11-18T00:03:14Z) - Dataset Meta-Learning from Kernel Ridge-Regression [18.253682891579402]
Kernel Inducing Points (KIP) can compress datasets by one or two orders of magnitude.
KIP-learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime.
arXiv Detail & Related papers (2020-10-30T18:54:04Z) - Dataset Condensation with Gradient Matching [36.14340188365505]
We propose a training set synthesis technique for data-efficient learning, called dataset Condensation, that learns to condense large dataset into a small set of informative synthetic samples for training deep neural networks from scratch.
We rigorously evaluate its performance in several computer vision benchmarks and demonstrate that it significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2020-06-10T16:30:52Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.