A Transformer-in-Transformer Network Utilizing Knowledge Distillation for Image Recognition
- URL: http://arxiv.org/abs/2502.16762v1
- Date: Mon, 24 Feb 2025 00:41:46 GMT
- Title: A Transformer-in-Transformer Network Utilizing Knowledge Distillation for Image Recognition
- Authors: Dewan Tauhid Rahman, Yeahia Sarker, Antar Mazumder, Md. Shamim Anower,
- Abstract summary: We propose an inner-outer transformer-based architecture, which gives attention to the global and local aspects of the image.<n>Our approach enhances learning efficiency and effectiveness.<n>Remarkably, the proposed Transformer-in-Transformer Network (TITN) model achieves impressive milestones across various datasets.
- Score: 0.8196125054032961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel knowledge distillation neural architecture leveraging efficient transformer networks for effective image classification. Natural images display intricate arrangements encompassing numerous extraneous elements. Vision transformers utilize localized patches to compute attention. However, exclusive dependence on patch segmentation proves inadequate in sufficiently encompassing the comprehensive nature of the image. To address this issue, we have proposed an inner-outer transformer-based architecture, which gives attention to the global and local aspects of the image. Moreover, The training of transformer models poses significant challenges due to their demanding resource, time, and data requirements. To tackle this, we integrate knowledge distillation into the architecture, enabling efficient learning. Leveraging insights from a larger teacher model, our approach enhances learning efficiency and effectiveness. Significantly, the transformer-in-transformer network acquires lightweight characteristics by means of distillation conducted within the feature extraction layer. Our featured network's robustness is established through substantial experimentation on the MNIST, CIFAR10, and CIFAR100 datasets, demonstrating commendable top-1 and top-5 accuracy. The conducted ablative analysis comprehensively validates the effectiveness of the chosen parameters and settings, showcasing their superiority against contemporary methodologies. Remarkably, the proposed Transformer-in-Transformer Network (TITN) model achieves impressive performance milestones across various datasets: securing the highest top-1 accuracy of 74.71% and a top-5 accuracy of 92.28% for the CIFAR100 dataset, attaining an unparalleled top-1 accuracy of 92.03% and top-5 accuracy of 99.80% for the CIFAR-10 dataset, and registering an exceptional top-1 accuracy of 99.56% for the MNIST dataset.
Related papers
- AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification [0.0]
AdaptoVision is a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy.
By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements.
It achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3% on CIFAR-10 and 85.77% on CIFAR-100, without relying on any pretrained weights.
arXiv Detail & Related papers (2025-04-17T05:23:07Z) - A Fusion Model for Artwork Identification Based on Convolutional Neural Networks and Transformers [6.57747694461617]
This paper proposes a fusion model combining CNNs and Transformers for identification artwork.
Experiments on Chinese and oil painting datasets show the fusion model outperforms individual CNN and Transformer models.
arXiv Detail & Related papers (2025-02-25T10:52:38Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections [8.372189962601077]
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers.
We propose a novel residual attention learning method for improving ViT-based architectures.
arXiv Detail & Related papers (2024-02-17T14:44:10Z) - LaCViT: A Label-aware Contrastive Fine-tuning Framework for Vision
Transformers [18.76039338977432]
Vision Transformers (ViTs) have emerged as popular models in computer vision, demonstrating state-of-the-art performance across various tasks.
We introduce a novel Label-aware Contrastive Training framework, LaCViT, which significantly enhances the quality of embeddings in ViTs.
LaCViT statistically significantly enhances the performance of three evaluated ViTs by up-to 10.78% under Top-1 Accuracy.
arXiv Detail & Related papers (2023-03-31T12:38:08Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - CMT: Convolutional Neural Networks Meet Vision Transformers [68.10025999594883]
Vision transformers have been successfully applied to image recognition tasks due to their ability to capture long-range dependencies within an image.
There are still gaps in both performance and computational cost between transformers and existing convolutional neural networks (CNNs)
We propose a new transformer based hybrid network by taking advantage of transformers to capture long-range dependencies, and of CNNs to model local features.
In particular, our CMT-S achieves 83.5% top-1 accuracy on ImageNet, while being 14x and 2x smaller on FLOPs than the existing DeiT and EfficientNet, respectively.
arXiv Detail & Related papers (2021-07-13T17:47:19Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - MViT: Mask Vision Transformer for Facial Expression Recognition in the
wild [77.44854719772702]
Facial Expression Recognition (FER) in the wild is an extremely challenging task in computer vision.
In this work, we first propose a novel pure transformer-based mask vision transformer (MViT) for FER in the wild.
Our MViT outperforms state-of-the-art methods on RAF-DB with 88.62%, FERPlus with 89.22%, and AffectNet-7 with 64.57%, respectively, and achieves a comparable result on AffectNet-8 with 61.40%.
arXiv Detail & Related papers (2021-06-08T16:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.