Related papers: Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer

URL: http://arxiv.org/abs/2501.01023v2
Date: Tue, 18 Mar 2025 15:30:22 GMT
Title: Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer
Authors: Ziyang Chen, Yongjun Zhang, Wenting Li, Bingshu Wang, Yabo Wu, Yong Zhao, C. L. Philip Chen,
Abstract summary: We present the Hadamard Attention Recurrent Stereo Transformer (HART) that incorporates the following components.<n>For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity.<n>We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses.<n>In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission.
Score: 54.97718043685824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In light of the advancements in transformer technology, extant research posits the construction of stereo transformers as a potential solution to the binocular stereo matching challenge. However, constrained by the low-rank bottleneck and quadratic complexity of attention mechanisms, stereo transformers still fail to demonstrate sufficient nonlinear expressiveness within a reasonable inference time. The lack of focus on key homonymous points renders the representations of such methods vulnerable to challenging conditions, including reflections and weak textures. Furthermore, a slow computing speed is not conducive to the application. To overcome these difficulties, we present the Hadamard Attention Recurrent Stereo Transformer (HART) that incorporates the following components: 1) For faster inference, we present a Hadamard product paradigm for the attention mechanism, achieving linear computational complexity. 2) We designed a Dense Attention Kernel (DAK) to amplify the differences between relevant and irrelevant feature responses. This allows HART to focus on important details. DAK also converts zero elements to non-zero elements to mitigate the reduced expressiveness caused by the low-rank bottleneck. 3) To compensate for the spatial and channel interaction missing in the Hadamard product, we propose MKOI to capture both global and local information through the interleaving of large and small kernel convolutions. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at https://github.com/ZYangChen/HART.

Related papers

Transformer-Based Person Search with High-Frequency Augmentation and Multi-Wave Mixing [18.871765626140782]
We propose a novel High-frequency Augmentation and Multi-Wave mixing (HAMW) method for person search.<n>HamW is designed to enhance the discriminative feature extraction capabilities of transformers.<n>HamW achieves state-of-the-art performance on both the CUHK-SYSU and PRW datasets.
arXiv Detail & Related papers (2025-06-29T12:08:26Z)
FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation [14.903360987684483]
We propose FEAT, a full-dimensional efficient attention Transformer for high-quality dynamic medical videos.<n>We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance.
arXiv Detail & Related papers (2025-06-05T12:31:02Z)
Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks)<n>By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z)
Neural Attention: A Novel Mechanism for Enhanced Expressive Power in Transformer Models [0.0]
We propose a technique that replaces dot products with feed-forward networks, enabling a more expressive representation of relationships between tokens.<n>This work establishes Neural Attention as an effective means of enhancing the predictive capabilities of transformer models across a variety of applications.
arXiv Detail & Related papers (2025-02-24T14:39:40Z)
Boosting ViT-based MRI Reconstruction from the Perspectives of Frequency Modulation, Spatial Purification, and Scale Diversification [6.341065683872316]
ViTs struggle to capture high-frequency components of images, limiting their ability to detect local textures and edge information.<n>Previous methods calculate multi-head self-attention (MSA) among both related and unrelated tokens in content.<n>The naive feed-forward network in ViTs cannot model the multi-scale information that is important for image restoration.
arXiv Detail & Related papers (2024-12-14T10:03:08Z)
HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution [6.583111551092333]
This paper introduces a novel model, the Hybrid Attention Aggregation Transformer (HAAT)<n>It is constructed by integrating Swin-Dense-Residual-Connected Blocks (SDRCB) with Hybrid Grid Attention Blocks (HGAB)<n>HGAB incorporates channel attention, sparse attention, and window attention to improve nonlocal feature fusion and achieve more visually compelling results.
arXiv Detail & Related papers (2024-11-27T02:47:17Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
Breaking the Low-Rank Dilemma of Linear Attention [61.55583836370135]
Linear attention provides a far more efficient solution by reducing the complexity to linear levels.<n>Our experiments indicate that this performance drop is due to the low-rank nature of linear attention's feature map.<n>We introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency.
arXiv Detail & Related papers (2024-11-12T08:30:59Z)
Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context. We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise. It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
Convolution and Attention Mixer for Synthetic Aperture Radar Image Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community. Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs) We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z)
Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution [32.29219284419944]
Cross-refinement adaptive feature modulation transformer (CRAFT)<n>We introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency.<n>Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods.
arXiv Detail & Related papers (2023-08-09T15:38:36Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
Reciprocal Attention Mixing Transformer for Lightweight Image Restoration [6.3159191692241095]
We propose a lightweight image restoration network, Reciprocal Attention Mixing Transformer (RAMiT) It employs bi-dimensional (spatial and channel) self-attentions in parallel with different numbers of multi-heads. It achieves state-of-the-art performance on multiple lightweight IR tasks, including super-resolution, color denoising, grayscale denoising, low-light enhancement, and deraining.
arXiv Detail & Related papers (2023-05-19T06:55:04Z)
Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising [64.11157141177208]
We propose a spectral enhanced rectangle Transformer to model the spatial and spectral correlation in hyperspectral images. For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain. For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise.
arXiv Detail & Related papers (2023-04-03T09:42:13Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Hyperbolic Cosine Transformer for LiDAR 3D Object Detection [6.2216654973540795]
We propose a two-stage hyperbolic cosine transformer (ChTR3D) for 3D object detection from LiDAR point clouds. The proposed ChTR3D refines proposals by applying cosh-attention in linear complexity to encode rich contextual relationships among points. Experiments on the widely used KITTI dataset demonstrate that, compared with vanilla attention, the cosh-attention significantly improves the inference speed with competitive performance.
arXiv Detail & Related papers (2022-11-10T13:54:49Z)
The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. They usually suffer from degraded performances on various tasks and corpus. In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z)
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z)
The KFIoU Loss for Rotated Object Detection [115.334070064346]
In this paper, we argue that one effective alternative is to devise an approximate loss who can achieve trend-level alignment with SkewIoU loss. Specifically, we model the objects as Gaussian distribution and adopt Kalman filter to inherently mimic the mechanism of SkewIoU. The resulting new loss called KFIoU is easier to implement and works better compared with exact SkewIoU.
arXiv Detail & Related papers (2022-01-29T10:54:57Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.