Effective pruning of web-scale datasets based on complexity of concept
clusters
- URL: http://arxiv.org/abs/2401.04578v2
- Date: Tue, 12 Mar 2024 10:35:56 GMT
- Title: Effective pruning of web-scale datasets based on complexity of concept
clusters
- Authors: Amro Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika
Chaudhuri, Ari S. Morcos
- Abstract summary: We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
- Score: 48.125618324485195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Utilizing massive web-scale datasets has led to unprecedented performance
gains in machine learning models, but also imposes outlandish compute
requirements for their training. In order to improve training and data
efficiency, we here push the limits of pruning large-scale multimodal datasets
for training CLIP-style models. Today's most effective pruning method on
ImageNet clusters data samples into separate concepts according to their
embedding and prunes away the most prototypical samples. We scale this approach
to LAION and improve it by noting that the pruning rate should be
concept-specific and adapted to the complexity of the concept. Using a simple
and intuitive complexity measure, we are able to reduce the training cost to a
quarter of regular training. By filtering from the LAION dataset, we find that
training on a smaller set of high-quality data can lead to higher performance
with significantly lower training costs. More specifically, we are able to
outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot
accuracy by 1.1p.p. while only using 27.7% of the data and training compute.
Despite a strong reduction in training cost, we also see improvements on
ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium
benchmark, we achieve a new state-of-the-art
Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a
competitive average zero-shot accuracy on 38 evaluation tasks.
Related papers
- Exploring Learning Complexity for Efficient Downstream Dataset Pruning [8.990878450631596]
Existing dataset pruning methods require training on the entire dataset.
We propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC)
Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.
arXiv Detail & Related papers (2024-02-08T02:29:33Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Large-scale Dataset Pruning with Dynamic Uncertainty [28.60845105174658]
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them.
In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop.
arXiv Detail & Related papers (2023-06-08T13:14:35Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Efficient Augmentation for Imbalanced Deep Learning [8.38844520504124]
We study a convolutional neural network's internal representation of imbalanced image data.
We measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider for minority classes.
This insight enables us to design an efficient three-phase CNN training framework for imbalanced data.
arXiv Detail & Related papers (2022-07-13T09:43:17Z) - Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [29.49873710927313]
We consider a self-supervised pre-training scenario that only leverages the target task data.
Our study shows that denoising autoencoders, such as BEiT, are more robust to the type and size of the pre-training data.
On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.
arXiv Detail & Related papers (2021-12-20T18:41:32Z) - EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.
We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency.
Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.