Related papers: The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

URL: http://arxiv.org/abs/2012.06908v2
Date: Mon, 29 Mar 2021 18:13:06 GMT
Title: The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
Authors: Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang
Abstract summary: Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Recent studies suggest that pre-training benefits from gigantic model capacity. In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
Score: 115.49214555402567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its downstream transferability? In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch yet still reach the full models' performance. We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_LTH_Pre-training.

Related papers

An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates [11.90029443742706]
We study the impact of Low-Rank Adaptation (LoRA) rank on the forgetting of the pretraining foundation task and on the plasticity and forgetting of subsequent ones. We also observe that vision transformers finetuned in that way exhibit a sort of contextual'' forgetting, a behaviour that we do not observe for residual networks.
arXiv Detail & Related papers (2024-05-28T11:29:25Z)
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks [139.3768582233067]
Battle of the Backbones (BoB) is a benchmarking tool for neural network based computer vision systems. We find that vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular. In apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive.
arXiv Detail & Related papers (2023-10-30T18:23:58Z)
RanPAC: Random Projections and Pre-trained Models for Continual Learning [59.07316955610658]
Continual learning (CL) aims to learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. We propose a concise and effective approach for CL with pre-trained models.
arXiv Detail & Related papers (2023-07-05T12:49:02Z)
Revisiting Weakly Supervised Pre-Training of Visual Perception Models [27.95816470075203]
Large-scale weakly supervised pre-training can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision. Our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems.
arXiv Detail & Related papers (2022-01-20T18:55:06Z)
Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z)
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora. We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity. We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z)
The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy. We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.