Related papers: CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion

URL: http://arxiv.org/abs/2602.05598v1
Date: Thu, 05 Feb 2026 12:33:09 GMT
Title: CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion
Authors: Aon Safdar, Mohamed Saadeldin,
Abstract summary: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range interactions via self-attention.<n>We introduce 'CAViT', a dual-attention architecture that replaces the static parameter with a dynamic, attention-based mechanism for feature interaction.<n>We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing FLOPs by over 30%.
Score: 0.3683202928838613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.

Related papers

Feature Complementation Architecture for Visual Place Recognition [19.779780157790423]
Visual place recognition (VPR) plays a crucial role in robotic localization and navigation.<n>Existing methods typically adopt convolutional neural networks (CNNs) or vision Transformers (ViTs) as feature extractors.<n>We propose a local-global feature complementation network (LGCN) for VPR which integrates a parallel CNN-ViT hybrid architecture with a dynamic feature fusion module (DFM)
arXiv Detail & Related papers (2025-06-14T08:32:55Z)
FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation [14.903360987684483]
We propose FEAT, a full-dimensional efficient attention Transformer for high-quality dynamic medical videos.<n>We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance.
arXiv Detail & Related papers (2025-06-05T12:31:02Z)
AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer [27.921949273217468]
Vision Transformers (ViTs) demonstrate remarkable performance in image classification through visual-token interaction learning. We propose Neural Cellular Automata (NCA) for Vision Transformers that uses NCA as plug-and-play adaptors between ViT layers. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks.
arXiv Detail & Related papers (2024-06-12T14:59:12Z)
Accelerating Vision Transformers Based on Heterogeneous Attention Patterns [89.86293867174324]
Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. We propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput.
arXiv Detail & Related papers (2023-10-11T17:09:19Z)
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition. DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z)
Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN) We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.