Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition
- URL: http://arxiv.org/abs/2407.19472v2
- Date: Sun, 6 Oct 2024 03:23:29 GMT
- Title: Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition
- Authors: Fernando Alonso-Fernandez, Kevin Hernandez-Diaz, Prayag Tiwari, Josef Bigun,
- Abstract summary: We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition.
Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
- Score: 49.14350399025926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. These architectures have demonstrated significant success in various computer vision tasks beyond the ones for which they were designed. This work builds on our previous study using off-the-shelf Convolutional Neural Network (CNN) and extends it to include the more recently proposed Vision Transformers (ViT). Despite being trained for generic object classification, middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images. We also demonstrate that CNNs and ViTs are highly complementary since their combination results in boosted accuracy. In addition, we show that a small portion of these pre-trained models can achieve good accuracy, resulting in thinner models with fewer parameters, suitable for resource-limited environments such as mobiles. This efficiency improves if traditional handcrafted features are added as well.
Related papers
- A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis [9.687982148528187]
Convolutional Neural Networks (CNNs) are currently among the best texture analysis approaches.
Vision Transformers (ViTs) have been surpassing the performance of CNNs on tasks such as object recognition.
This work explores various pre-trained ViT architectures when transferred to tasks that rely on textures.
arXiv Detail & Related papers (2024-06-10T09:48:13Z) - Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review [1.0937094979510213]
Review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification.
Our goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry.
arXiv Detail & Related papers (2024-06-05T17:32:22Z) - Structured Initialization for Attention in Vision Transformers [34.374054040300805]
convolutional neural networks (CNNs) have an architectural inductive bias enabling them to perform well on small-scale problems.
We argue that the architectural bias inherent to CNNs can be reinterpreted as an initialization bias within ViT.
This insight is significant as it empowers ViTs to perform equally well on small-scale problems while maintaining their flexibility for large-scale applications.
arXiv Detail & Related papers (2024-04-01T14:34:47Z) - LowDINO -- A Low Parameter Self Supervised Learning Model [0.0]
This research aims to explore the possibility of designing a neural network architecture that allows for small networks to adopt the properties of huge networks.
Previous studies have shown that using convolutional neural networks (ConvNets) can provide inherent inductive bias.
To reduce the number of parameters, attention mechanisms are utilized through the usage of MobileViT blocks.
arXiv Detail & Related papers (2023-05-28T18:34:59Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - Improving Vision Transformers by Revisiting High-frequency Components [106.7140968644414]
We show that Vision Transformer (ViT) models are less effective in capturing the high-frequency components of images than CNN models.
To compensate, we propose HAT, which directly augments high-frequency components of images via adversarial training.
We show that HAT can consistently boost the performance of various ViT models.
arXiv Detail & Related papers (2022-04-03T05:16:51Z) - VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.