Related papers: ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network

ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network

URL: http://arxiv.org/abs/2602.05262v1
Date: Thu, 05 Feb 2026 03:43:29 GMT
Title: ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network
Authors: Junzhou Li, Manqi Zhao, Yilin Gao, Zhiheng Yu, Yin Li, Dongsheng Jiang, Li Xiao,
Abstract summary: textbfReGLA integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling.<n>ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of textbf3.1% AP on object detection and textbf3.6% mIoU on ADE20K semantic segmentation.
Score: 14.912003445763688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.

Related papers

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism [41.02402160100821]
Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability.<n>We propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks.<n>Experiments on mainstream datasets show that SAEViT achieves Top-1 accuracies of 76.3% and 79.6% on the ImageNet-1K classification task.
arXiv Detail & Related papers (2025-08-23T03:05:34Z)
Residual Prior-driven Frequency-aware Network for Image Fusion [6.90874640835234]
Image fusion aims to integrate complementary information across modalities to generate high-quality fused images.<n>We propose a Residual Prior-driven Frequency-aware Network, termed as RPFNet.
arXiv Detail & Related papers (2025-07-09T10:48:00Z)
LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging [6.360399841791849]
We propose textbfLoLA-SpecViT(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer.<n>Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency.<n>Our framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics.
arXiv Detail & Related papers (2025-06-21T16:46:00Z)
LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation [14.20517652381698]
A single-branch lightweight global modeling network (LGM-Pose) is proposed to address these challenges.<n>In the network, a lightweight MobileViM Block is designed with a proposed Lightweight Attentional Representation Module (LARM)
arXiv Detail & Related papers (2025-06-05T02:29:04Z)
Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z)
VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.<n>Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.<n>We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z)
FasterPose: A Faster Simple Baseline for Human Pose Estimation [65.8413964785972]
We propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose. We study the training behavior of FasterPose, and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence. Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy.
arXiv Detail & Related papers (2021-07-07T13:39:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.