ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network
- URL: http://arxiv.org/abs/2602.05262v1
- Date: Thu, 05 Feb 2026 03:43:29 GMT
- Title: ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network
- Authors: Junzhou Li, Manqi Zhao, Yilin Gao, Zhiheng Yu, Yin Li, Dongsheng Jiang, Li Xiao,
- Abstract summary: textbfReGLA integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling.<n>ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of textbf3.1% AP on object detection and textbf3.6% mIoU on ADE20K semantic segmentation.
- Score: 14.912003445763688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
Related papers
- A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism [41.02402160100821]
Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability.<n>We propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks.<n>Experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3% and 79.6% on the ImageNet-1K classification task.
arXiv Detail & Related papers (2025-08-23T03:05:34Z) - Residual Prior-driven Frequency-aware Network for Image Fusion [6.90874640835234]
Image fusion aims to integrate complementary information across modalities to generate high-quality fused images.<n>We propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet.
arXiv Detail & Related papers (2025-07-09T10:48:00Z) - LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging [6.360399841791849]
We propose textbfLoLA-SpecViT(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer.<n>Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency.<n>Our framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics.
arXiv Detail & Related papers (2025-06-21T16:46:00Z) - LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation [14.20517652381698]
A single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges.<n>In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM)
arXiv Detail & Related papers (2025-06-05T02:29:04Z) - Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z) - VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR)
CFSR inherits the advantages of both convolution-based and transformer-based approaches.
Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.<n>Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.<n>We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - FasterPose: A Faster Simple Baseline for Human Pose Estimation [65.8413964785972]
We propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose.
We study the training behavior of FasterPose, and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence.
Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy.
arXiv Detail & Related papers (2021-07-07T13:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.