EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification
- URL: http://arxiv.org/abs/2511.18691v1
- Date: Mon, 24 Nov 2025 02:11:19 GMT
- Title: EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification
- Authors: Kazi Reyazul Hasan, Md Nafiu Rahman, Wasif Jalal, Sadif Ahmed, Shahriar Raj, Mubasshira Musarrat, Muhammad Abdullah Adnan,
- Abstract summary: Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost.<n>We introduce EVCC, a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet.<n> Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models.
- Score: 0.5394291557377919
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.
Related papers
- SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition [15.125734989910429]
SpaRTAN is a lightweight architectural design that enhances spatial and channel-wise information processing.<n>SpaRTAN achieves remarkable efficiency while maintaining competitive performance.
arXiv Detail & Related papers (2025-07-15T05:34:56Z) - FTCFormer: Fuzzy Token Clustering Transformer for Image Classification [22.410199372985584]
Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks.<n>Most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions.<n>We propose Fuzzy Token Clustering Transformer (FTCFormer) to dynamically generate vision tokens based on the semantic meanings instead of spatial positions.
arXiv Detail & Related papers (2025-07-14T13:49:47Z) - S2AFormer: Strip Self-Attention for Efficient Vision Transformer [37.930090368513355]
Vision Transformer (ViT) has made significant advancements in computer vision.<n>Recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs.<n>We propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA)
arXiv Detail & Related papers (2025-05-28T10:17:23Z) - ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages [0.0]
Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies.<n>We propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers.
arXiv Detail & Related papers (2025-04-21T03:00:17Z) - AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification [0.0]
AdaptoVision is a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy.<n>By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements.<n>It achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3% on CIFAR-10 and 85.77% on CIFAR-100, without relying on any pretrained weights.
arXiv Detail & Related papers (2025-04-17T05:23:07Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition [63.93802691275012]
We propose a lightweight Dual Dynamic Token Mixer (D-Mixer) to simultaneously learn global and local dynamics.<n>We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network.<n>In the ImageNet-1K classification, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost.
arXiv Detail & Related papers (2023-10-30T09:35:56Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - A Data-scalable Transformer for Medical Image Segmentation:
Architecture, Model Efficiency, and Benchmark [45.543140413399506]
MedFormer is a data-scalable Transformer designed for generalizable 3D medical image segmentation.
Our approach incorporates three key elements: a desirable inductive bias, hierarchical modeling with linear-complexity attention, and multi-scale feature fusion.
arXiv Detail & Related papers (2022-02-28T22:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.