Related papers: Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

URL: http://arxiv.org/abs/2406.05612v2
Date: Sat, 29 Jun 2024 12:26:42 GMT
Title: Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision
Authors: Pranav Jeevan, Amit Sethi,
Abstract summary: Architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors. Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones.
Score: 4.600687314645625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In contemporary computer vision applications, particularly image classification, architectural backbones pre-trained on large datasets like ImageNet are commonly employed as feature extractors. Despite the widespread use of these pre-trained convolutional neural networks (CNNs), there remains a gap in understanding the performance of various resource-efficient backbones across diverse domains and dataset sizes. Our study systematically evaluates multiple lightweight, pre-trained CNN backbones under consistent training settings across a variety of datasets, including natural images, medical images, galaxy images, and remote sensing images. This comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem, especially in scenarios involving small datasets where fine-tuning a pre-trained network is crucial. Even though attention-based architectures are gaining popularity, we observed that they tend to perform poorly under low data finetuning tasks compared to CNNs. We also observed that some CNN architectures such as ConvNeXt, RegNet and EfficientNet performs well compared to others on a diverse set of domains consistently. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones, facilitating informed decision-making in model selection for a broad spectrum of computer vision domains. Our code is available here: https://github.com/pranavphoenix/Backbones

Related papers

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. This paper explores the differences across various CLIP-trained vision backbones. Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z)
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. We investigate the differences in CLIP performance among various neural architectures. We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z)
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks [139.3768582233067]
Battle of the Backbones (BoB) is a benchmarking tool for neural network based computer vision systems. We find that vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular. In apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive.
arXiv Detail & Related papers (2023-10-30T18:23:58Z)
Benchmarking CNN on 3D Anatomical Brain MRI: Architectures, Data Augmentation and Deep Ensemble Learning [2.1446056201053185]
We propose an extensive benchmark of recent state-of-the-art (SOTA) 3D CNN, evaluating also the benefits of data augmentation and deep ensemble learning. Experiments were conducted on a large multi-site 3D brain anatomical MRI data-set comprising N=10k scans on 3 challenging tasks: age prediction, sex classification, and schizophrenia diagnosis. We found that all models provide significantly better predictions with VBM images than quasi-raw data. DenseNet and tiny-DenseNet, a lighter version that we proposed, provide a good compromise in terms of performance in all data regime
arXiv Detail & Related papers (2021-06-02T13:00:35Z)
Deep Features for training Support Vector Machine [16.795405355504077]
This paper develops a generic computer vision system based on features extracted from trained CNNs. Multiple learned features are combined into a single structure to work on different image classification tasks.
arXiv Detail & Related papers (2021-04-08T03:13:09Z)
Joint Learning of Neural Transfer and Architecture Adaptation for Image Recognition [77.95361323613147]
Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset. In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness. Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
arXiv Detail & Related papers (2021-03-31T08:15:17Z)
An Efficient Framework for Zero-Shot Sketch-Based Image Retrieval [36.254157442709264]
Zero-shot Sketch-based Image Retrieval (ZS-SBIR) has attracted the attention of the computer vision community due to it's real-world applications. ZS-SBIR inherits the main challenges of multiple computer vision problems including content-based Image Retrieval (CBIR), zero-shot learning and domain adaptation.
arXiv Detail & Related papers (2021-02-08T06:10:37Z)
The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer. Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z)
Fusion of CNNs and statistical indicators to improve image classification [65.51757376525798]
Convolutional Networks have dominated the field of computer vision for the last ten years. Main strategy to prolong this trend relies on further upscaling networks in size. We hypothesise that adding heterogeneous sources of information may be more cost-effective to a CNN than building a bigger network.
arXiv Detail & Related papers (2020-12-20T23:24:31Z)
Convolution Neural Network Architecture Learning for Remote Sensing Scene Classification [22.29957803992306]
This paper proposes an automatically architecture learning procedure for remote sensing scene classification. We introduce a learning strategy which can allow efficient search in the architecture space by means of gradient descent. An architecture generator finally maps the set of parameters into the CNN used in our experiments.
arXiv Detail & Related papers (2020-01-27T07:42:46Z)
Inferring Convolutional Neural Networks' accuracies from their architectural characterizations [0.0]
We study the relationships between a CNN's architecture and its performance. We show that the attributes can be predictive of the networks' performance in two specific computer vision-based physics problems. We use machine learning models to predict whether a network can perform better than a certain threshold accuracy before training.
arXiv Detail & Related papers (2020-01-07T16:41:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.