The Lottery Tickets Hypothesis for Supervised and Self-supervised
Pre-training in Computer Vision Models
- URL: http://arxiv.org/abs/2012.06908v2
- Date: Mon, 29 Mar 2021 18:13:06 GMT
- Title: The Lottery Tickets Hypothesis for Supervised and Self-supervised
Pre-training in Computer Vision Models
- Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang,
Michael Carbin, Zhangyang Wang
- Abstract summary: Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation.
Recent studies suggest that pre-training benefits from gigantic model capacity.
In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
- Score: 115.49214555402567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The computer vision world has been re-gaining enthusiasm in various
pre-trained models, including both classical ImageNet supervised pre-training
and recently emerged self-supervised pre-training such as simCLR and MoCo.
Pre-trained weights often boost a wide range of downstream tasks including
classification, detection, and segmentation. Latest studies suggest that
pre-training benefits from gigantic model capacity. We are hereby curious and
ask: after pre-training, does a pre-trained model indeed have to stay large for
its downstream transferability?
In this paper, we examine supervised and self-supervised pre-trained models
through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly
sparse matching subnetworks that can be trained in isolation from (nearly)
scratch yet still reach the full models' performance. We extend the scope of
LTH and question whether matching subnetworks still exist in pre-trained
computer vision models, that enjoy the same downstream transfer performance.
Our extensive experiments convey an overall positive message: from all
pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we
are consistently able to locate such matching subnetworks at 59.04% to 96.48%
sparsity that transfer universally to multiple downstream tasks, whose
performance see no degradation compared to using full pre-trained weights.
Further analyses reveal that subnetworks found from different pre-training tend
to yield diverse mask structures and perturbation sensitivities. We conclude
that the core LTH observations remain generally relevant in the pre-training
paradigm of computer vision, but more delicate discussions are needed in some
cases. Codes and pre-trained models will be made available at:
https://github.com/VITA-Group/CV_LTH_Pre-training.
Related papers
- An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates [11.90029443742706]
We study the impact of Low-Rank Adaptation (LoRA) rank on the forgetting of the pretraining foundation task and on the plasticity and forgetting of subsequent ones.
We also observe that vision transformers finetuned in that way exhibit a sort of contextual'' forgetting, a behaviour that we do not observe for residual networks.
arXiv Detail & Related papers (2024-05-28T11:29:25Z) - Battle of the Backbones: A Large-Scale Comparison of Pretrained Models
across Computer Vision Tasks [139.3768582233067]
Battle of the Backbones (BoB) is a benchmarking tool for neural network based computer vision systems.
We find that vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular.
In apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive.
arXiv Detail & Related papers (2023-10-30T18:23:58Z) - RanPAC: Random Projections and Pre-trained Models for Continual Learning [59.07316955610658]
Continual learning (CL) aims to learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones.
We propose a concise and effective approach for CL with pre-trained models.
arXiv Detail & Related papers (2023-07-05T12:49:02Z) - Revisiting Weakly Supervised Pre-Training of Visual Perception Models [27.95816470075203]
Large-scale weakly supervised pre-training can outperform fully supervised approaches.
This paper revisits weakly-supervised pre-training of models using hashtag supervision.
Our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems.
arXiv Detail & Related papers (2022-01-20T18:55:06Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.