Battle of the Backbones: A Large-Scale Comparison of Pretrained Models
across Computer Vision Tasks
- URL: http://arxiv.org/abs/2310.19909v2
- Date: Mon, 20 Nov 2023 03:05:50 GMT
- Title: Battle of the Backbones: A Large-Scale Comparison of Pretrained Models
across Computer Vision Tasks
- Authors: Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu,
Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes,
Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein
- Abstract summary: Battle of the Backbones (BoB) is a benchmarking tool for neural network based computer vision systems.
We find that vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular.
In apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive.
- Score: 139.3768582233067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network based computer vision systems are typically built on a
backbone, a pretrained or randomly initialized feature extractor. Several years
ago, the default option was an ImageNet-trained convolutional neural network.
However, the recent past has seen the emergence of countless backbones
pretrained using various algorithms and datasets. While this abundance of
choice has led to performance increases for a range of systems, it is difficult
for practitioners to make informed decisions about which backbone to choose.
Battle of the Backbones (BoB) makes this choice easier by benchmarking a
diverse suite of pretrained models, including vision-language models, those
trained via self-supervised learning, and the Stable Diffusion backbone, across
a diverse set of computer vision tasks ranging from classification to object
detection to OOD generalization and more. Furthermore, BoB sheds light on
promising directions for the research community to advance computer vision by
illuminating strengths and weakness of existing approaches through a
comprehensive analysis conducted on more than 1500 training runs. While vision
transformers (ViTs) and self-supervised learning (SSL) are increasingly
popular, we find that convolutional neural networks pretrained in a supervised
fashion on large training sets still perform best on most tasks among the
models we consider. Moreover, in apples-to-apples comparisons on the same
architectures and similarly sized pretraining datasets, we find that SSL
backbones are highly competitive, indicating that future works should perform
SSL pretraining with advanced architectures and larger pretraining datasets. We
release the raw results of our experiments along with code that allows
researchers to put their own backbones through the gauntlet here:
https://github.com/hsouri/Battle-of-the-Backbones
Related papers
- Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision [4.600687314645625]
Architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors.
Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings.
Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones.
arXiv Detail & Related papers (2024-06-09T02:01:25Z) - Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition [0.19183348587701113]
Transferring the weights of a pre-trained model to assist another task has become a crucial part of modern deep learning.
Our experiments will demonstrate the usefulness of in-domain models and datasets for bird species recognition.
arXiv Detail & Related papers (2024-04-26T08:47:28Z) - Transfer Learning between Motor Imagery Datasets using Deep Learning --
Validation of Framework and Comparison of Datasets [0.0]
We present a simple deep learning-based framework commonly used in computer vision.
We demonstrate its effectiveness for cross-dataset transfer learning in mental imagery decoding tasks.
arXiv Detail & Related papers (2023-09-04T20:58:57Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Ensembling Off-the-shelf Models for GAN Training [55.34705213104182]
We find that pretrained computer vision models can significantly improve performance when used in an ensemble of discriminators.
We propose an effective selection mechanism, by probing the linear separability between real and fake samples in pretrained model embeddings.
Our method can improve GAN training in both limited data and large-scale settings.
arXiv Detail & Related papers (2021-12-16T18:59:50Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - The Lottery Tickets Hypothesis for Supervised and Self-supervised
Pre-training in Computer Vision Models [115.49214555402567]
Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation.
Recent studies suggest that pre-training benefits from gigantic model capacity.
In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
arXiv Detail & Related papers (2020-12-12T21:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.