A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation
- URL: http://arxiv.org/abs/2510.04794v1
- Date: Mon, 06 Oct 2025 13:18:27 GMT
- Title: A Comparative Study of Vision Transformers and CNNs for Few-Shot Rigid Transformation and Fundamental Matrix Estimation
- Authors: Alon Kaya, Igal Bilik, Inna Stainvas,
- Abstract summary: Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations.<n>This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs.<n> Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios.
- Score: 3.5684665108045377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-transformers (ViTs) and large-scale convolution-neural-networks (CNNs) have reshaped computer vision through pretrained feature representations that enable strong transfer learning for diverse tasks. However, their efficiency as backbone architectures for geometric estimation tasks involving image deformations in low-data regimes remains an open question. This work considers two such tasks: 1) estimating 2D rigid transformations between pairs of images and 2) predicting the fundamental matrix for stereo image pairs, an important problem in various applications, such as autonomous mobility, robotics, and 3D scene reconstruction. Addressing this intriguing question, this work systematically compares large-scale CNNs (ResNet, EfficientNet, CLIP-ResNet) with ViT-based foundation models (CLIP-ViT variants and DINO) in various data size settings, including few-shot scenarios. These pretrained models are optimized for classification or contrastive learning, encouraging them to focus mostly on high-level semantics. The considered tasks require balancing local and global features differently, challenging the straightforward adoption of these models as the backbone. Empirical comparative analysis shows that, similar to training from scratch, ViTs outperform CNNs during refinement in large downstream-data scenarios. However, in small data scenarios, the inductive bias and smaller capacity of CNNs improve their performance, allowing them to match that of a ViT. Moreover, ViTs exhibit stronger generalization in cross-domain evaluation where the data distribution changes. These results emphasize the importance of carefully selecting model architectures for refinement, motivating future research towards hybrid architectures that balance local and global representations.
Related papers
- Training a Custom CNN on Five Heterogeneous Image Datasets [1.4583375893645076]
This study investigates the effectiveness of CNN-based architectures across five datasets spanning agricultural and urban domains.<n>These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance.<n>We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning.
arXiv Detail & Related papers (2026-01-08T08:44:17Z) - Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation [10.995974662579124]
We present a novel hybrid architecture that combines convolutional neural networks (CNNs) with Vision Transformers (ViT)<n>Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets.<n>The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation.
arXiv Detail & Related papers (2025-10-31T09:36:28Z) - Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery [0.0]
Vision Transformers (ViT) have brought a new wave of research in the field of computer vision.<n>This paper focuses on the comparison of three key factors of using (or not using) ViT for semantic segmentation of aerial images.<n>We show that the novel combined weighted loss function significantly boosts the CNN model's performance compared to transfer learning with ViT.
arXiv Detail & Related papers (2024-11-14T00:18:04Z) - Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review [1.0937094979510213]
Review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification.
Our goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry.
arXiv Detail & Related papers (2024-06-05T17:32:22Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z) - A Data-scalable Transformer for Medical Image Segmentation:
Architecture, Model Efficiency, and Benchmark [45.543140413399506]
MedFormer is a data-scalable Transformer designed for generalizable 3D medical image segmentation.
Our approach incorporates three key elements: a desirable inductive bias, hierarchical modeling with linear-complexity attention, and multi-scale feature fusion.
arXiv Detail & Related papers (2022-02-28T22:59:42Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.