Related papers: CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

URL: http://arxiv.org/abs/2505.08259v1
Date: Tue, 13 May 2025 06:17:18 GMT
Title: CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets
Authors: Aidar Amangeldi, Angsar Taigonyrov, Muhammad Huzaid Jawad, Chinedu Emmanuel Mbonu,
Abstract summary: We introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet.<n>We demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline's performance, achieve faster inference, and operate with fewer parameters.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet. Our goal is to reduce inference latency and model complexity with acceptable accuracy degradation. Through systematic hyperparameter variations, we demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline's performance, achieve faster inference, and operate with fewer parameters, highlighting their viability for deployment in resource-constrained environments.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
CWT-Net: Super-resolution of Histopathology Images Using a Cross-scale Wavelet-based Transformer [15.930878163092983]
Super-resolution (SR) aims to enhance the quality of low-resolution images and has been widely applied in medical imaging. We propose a novel network called CWT-Net, which leverages cross-scale image wavelet transform and Transformer architecture. Our model significantly outperforms state-of-the-art methods in both performance and visualization evaluations.
arXiv Detail & Related papers (2024-09-11T08:26:28Z)
GenFormer -- Generated Images are All You Need to Improve Robustness of Transformers on Small Datasets [11.343905946690352]
We propose GenFormer, a data augmentation strategy utilizing generated images to improve transformer accuracy and robustness on small-scale image classification tasks. In our comprehensive evaluation we propose Tiny ImageNetV2, -R, and -A as new test set variants of Tiny ImageNet. We prove the effectiveness of our approach under challenging conditions with limited training data, demonstrating significant improvements in both accuracy and robustness.
arXiv Detail & Related papers (2024-08-26T09:26:08Z)
Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z)
Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. CNNs are used to augment the local texture information of coarse priors. DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z)
A comparative study between vision transformers and CNNs in digital pathology [1.71601014035428]
This work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types. We compared the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks.
arXiv Detail & Related papers (2022-06-01T10:41:11Z)
Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS) The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z)
Patch Similarity Aware Data-Free Quantization for Vision Transformers [2.954890575035673]
We propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers. We analyze the self-attention module's properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images. Experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT.
arXiv Detail & Related papers (2022-03-04T11:47:20Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Understanding Robustness of Transformers for Image Classification [34.51672491103555]
Vision Transformer (ViT) has surpassed ResNets for image classification. Details of the Transformer architecture lead one to wonder whether these networks are as robust. We find that ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations.
arXiv Detail & Related papers (2021-03-26T16:47:55Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
Perceptually Optimizing Deep Image Compression [53.705543593594285]
Mean squared error (MSE) and $ell_p$ norms have largely dominated the measurement of loss in neural networks. We propose a different proxy approach to optimize image analysis networks against quantitative perceptual models.
arXiv Detail & Related papers (2020-07-03T14:33:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.