A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism
- URL: http://arxiv.org/abs/2508.16884v2
- Date: Thu, 11 Sep 2025 14:34:08 GMT
- Title: A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism
- Authors: Yi Zhang, Lingxiao Wei, Bowei Zhang, Ziwei Liu, Kai Yi, Shu Hu,
- Abstract summary: Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability.<n>We propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks.<n>Experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3% and 79.6% on the ImageNet-1K classification task.
- Score: 41.02402160100821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. \textcolor{blue}{However, its large model size and weak local feature modeling ability hinder its application in real scenarios. To balance computation efficiency and performance in downstream vision tasks, we propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks. Specifically, a Sparsely Aggregated Attention (SAA) module has been proposed to perform adaptive sparse sampling and recover the feature map via deconvolution operation,} which significantly reduces the computational complexity of attention operations. In addition, a Channel-Interactive Feed-Forward Network (CIFFN) layer is developed to enhance inter-channel information exchange through feature decomposition and redistribution, which mitigates the redundancy in traditional feed-forward networks (FFN). Finally, a hierarchical pyramid structure with embedded depth-wise separable convolutional blocks (DWSConv) is devised to further strengthen convolutional features. Extensive experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3\% and 79.6\% on the ImageNet-1K classification task with only 0.8 GFLOPs and 1.3 GFLOPs, respectively, demonstrating a lightweight solution for fundamental vision tasks.
Related papers
- ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network [14.912003445763688]
textbfReGLA integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling.<n>ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of textbf3.1% AP on object detection and textbf3.6% mIoU on ADE20K semantic segmentation.
arXiv Detail & Related papers (2026-02-05T03:43:29Z) - MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture [12.168520751389622]
Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy.<n>This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear sequence modeling capabilities.<n>On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency.
arXiv Detail & Related papers (2025-07-06T14:52:26Z) - ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages [0.0]
Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies.<n>We propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers.
arXiv Detail & Related papers (2025-04-21T03:00:17Z) - Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation [5.326302374594885]
Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios.<n>We propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors.
arXiv Detail & Related papers (2025-04-20T04:12:38Z) - LSNet: See Large, Focus Small [67.05569159984691]
We introduce LS (textbfLarge-textbfSmall) convolution, which combines large- kernel perception and small- kernel aggregation.<n>LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks.
arXiv Detail & Related papers (2025-03-29T16:00:54Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.<n>We introduce key innovations to optimize generative performance for vision tasks.<n>The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - Unifying Dimensions: A Linear Adaptive Approach to Lightweight Image Super-Resolution [6.857919231112562]
Window-based transformers have demonstrated outstanding performance in super-resolution tasks.
They exhibit higher computational complexity and inference latency than convolutional neural networks.
We construct a convolution-based Transformer framework named the linear adaptive mixer network (LAMNet)
arXiv Detail & Related papers (2024-09-26T07:24:09Z) - iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency [0.5497663232622965]
We present iiANET, an efficient hybrid visual backbone designed to improve the modeling of long-range dependen-cies.<n>The core innovation of iiANET is the iiABlock, a unified building block that in-tegrates global r-MHSA (Multi-Head Self-Attention) and convolutional layers in paral-lel.
arXiv Detail & Related papers (2024-07-10T12:39:02Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.