Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
- URL: http://arxiv.org/abs/2603.02727v2
- Date: Thu, 05 Mar 2026 00:01:22 GMT
- Title: Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
- Authors: Hongbo Zheng, Afshin Bozorgpour, Dorit Merhof, Minjia Zhang,
- Abstract summary: PVT-GDLA is a decoder-centric Transformer that restores sharp, long-range dependencies at linear time.<n>It achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets.
- Score: 15.30336007288786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical image segmentation requires models that preserve fine anatomical boundaries while remaining efficient for clinical deployment. While transformers capture long-range dependencies, they suffer from quadratic attention cost and large data requirements, whereas CNNs are compute-friendly yet struggle with global reasoning. Linear attention offers $\mathcal{O}(N)$ scaling, but often exhibits training instability and attention dilution, yielding diffuse maps. We introduce PVT-GDLA, a decoder-centric Transformer that restores sharp, long-range dependencies at linear time. Its core, Gated Differential Linear Attention (GDLA), computes two kernelized attention paths on complementary query/key subspaces and subtracts them with a learnable, channel-wise scale to cancel common-mode noise and amplify relevant context. A lightweight, head-specific gate injects nonlinearity and input-adaptive sparsity, mitigating attention sink, and a parallel local token-mixing branch with depthwise convolution strengthens neighboring-token interactions, improving boundary fidelity, all while retaining $\mathcal{O}(N)$ complexity and low parameter overhead. Coupled with a pretrained Pyramid Vision Transformer (PVT) encoder, PVT-GDLA achieves state-of-the-art accuracy across CT, MRI, ultrasound, and dermoscopy benchmarks under equal training budgets, with comparable parameters but lower FLOPs than CNN-, Transformer-, hybrid-, and linear-attention baselines. PVT-GDLA provides a practical path to fast, scalable, high-fidelity medical segmentation in clinical environments and other resource-constrained settings.
Related papers
- Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series [15.981619117274667]
Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications.<n>Recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies.<n>This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized.<n>We propose CoTAR, a centralized-based module designed to replace decentralized attention.
arXiv Detail & Related papers (2026-02-09T04:39:22Z) - Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z) - LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z) - MedLiteNet: Lightweight Hybrid Medical Image Segmentation Model [17.73370811236741]
We introduce the MedLiteNet, a lightweight CNN Transformer hybrid tailored for dermoscopic segmentation.<n>The encoder stacks depth-wise Mobile Inverted Bottleneck blocks to curb computation, inserts a bottleneck-level cross-scale token-mixing unit to exchange information between resolutions, and embeds a boundary-aware self-attention module to sharpen lesion contours.
arXiv Detail & Related papers (2025-09-03T05:59:13Z) - U-R-VEDA: Integrating UNET, Residual Links, Edge and Dual Attention, and Vision Transformer for Accurate Semantic Segmentation of CMRs [0.0]
We propose a deep learning based enhanced UNet model, U-R-Veda, which integrates convolution transformations, vision transformer, residual links, channelattention, and spatial attention.<n>The model significantly improves the semantic segmentation of cardiac magnetic resonance (CMR) images.<n>Performance results show that U-R-Veda achieves an average accuracy of 95.2%, based on DSC.
arXiv Detail & Related papers (2025-06-25T04:10:09Z) - Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis [9.090504201460817]
Histo Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors.
Previous methods typically employ Multi-pathology Learning to enable slide-level prediction given only slide-level labels.
To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention.
arXiv Detail & Related papers (2024-10-18T06:12:36Z) - Prototype Learning Guided Hybrid Network for Breast Tumor Segmentation in DCE-MRI [58.809276442508256]
We propose a hybrid network via the combination of convolution neural network (CNN) and transformer layers.
The experimental results on private and public DCE-MRI datasets demonstrate that the proposed hybrid network superior performance than the state-of-the-art methods.
arXiv Detail & Related papers (2024-08-11T15:46:00Z) - Unlocking Fine-Grained Details with Wavelet-based High-Frequency
Enhancement in Transformers [4.208461204572879]
Medical image segmentation is a critical task that plays a vital role in diagnosis, treatment planning, and disease monitoring.
We address the local feature deficiency of the Transformer model by carefully re-designing the self-attention map.
We propose a multi-scale context enhancement block within skip connections to adaptively model inter-scale dependencies.
arXiv Detail & Related papers (2023-08-25T15:42:19Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Fuzzy Attention Neural Network to Tackle Discontinuity in Airway
Segmentation [67.19443246236048]
Airway segmentation is crucial for the examination, diagnosis, and prognosis of lung diseases.
Some small-sized airway branches (e.g., bronchus and terminaloles) significantly aggravate the difficulty of automatic segmentation.
This paper presents an efficient method for airway segmentation, comprising a novel fuzzy attention neural network and a comprehensive loss function.
arXiv Detail & Related papers (2022-09-05T16:38:13Z) - Weakly-supervised Learning For Catheter Segmentation in 3D Frustum
Ultrasound [74.22397862400177]
We propose a novel Frustum ultrasound based catheter segmentation method.
The proposed method achieved the state-of-the-art performance with an efficiency of 0.25 second per volume.
arXiv Detail & Related papers (2020-10-19T13:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.