Related papers: Efficient Linear Attention for Fast and Accurate Keypoint Matching

Efficient Linear Attention for Fast and Accurate Keypoint Matching

URL: http://arxiv.org/abs/2204.07731v1
Date: Sat, 16 Apr 2022 06:17:36 GMT
Title: Efficient Linear Attention for Fast and Accurate Keypoint Matching
Authors: Suwichaya Suwanwimolkul and Satoshi Komorita
Abstract summary: Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. We propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints.
Score: 0.9699586426043882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. To solve this problem, we employ an efficient linear attention for the linear computational complexity. Then, we propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints. To further improve the efficiency, we propose the joint learning of feature matching and description. Our learning enables simpler and faster matching than Sinkhorn, often used in matching the learned descriptors from Transformers. Our method achieves competitive performance with only 0.84M learnable parameters against the bigger SOTAs, SuperGlue (12M parameters) and SGMNet (30M parameters), on three benchmarks, HPatch, ETH, and Aachen Day-Night.

Related papers

VMatcher: State-Space Semi-Dense Local Feature Matching [0.0]
VMatcher is a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs.<n>VMatcher integrates Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism.
arXiv Detail & Related papers (2025-07-31T09:39:16Z)
Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction [52.14200610448542]
A transformer has a quadratic complexity, leading to high inference costs and latency for long sequences.<n>We propose a simple, novel, and effective procedure for correcting this distributional shift.<n>Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.
arXiv Detail & Related papers (2025-05-16T13:48:33Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation [49.8026360054331]
iFlame is a novel transformer-based network architecture for mesh generation. We propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance.
arXiv Detail & Related papers (2025-03-20T19:10:37Z)
Transformers with Joint Tokens and Local-Global Attention for Efficient Human Pose Estimation [34.99437411281915]
This paper proposes two ViT-based models for accurate, efficient, and robust 2D pose estimation. Experiments on six benchmarks demonstrate that the proposed methods significantly outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-02-28T22:34:22Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed [42.861344584752]
Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. Our method can achieve higher accuracy compared with competitive semi-dense matchers.
arXiv Detail & Related papers (2024-03-07T18:58:40Z)
Point Transformer V3: Simpler, Faster, Stronger [88.80496333515325]
This paper focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing. We present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms. PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios.
arXiv Detail & Related papers (2023-12-15T18:59:59Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences [16.066338004414092]
textitDiffuser is a new efficient Transformer for sequence-to-sequence modeling. It incorporates all token interactions within one attention layer while maintaining low computation and memory costs. We show its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective.
arXiv Detail & Related papers (2022-10-21T08:13:34Z)
Linear Video Transformer with Feature Fixation [34.324346469406926]
Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. We propose a feature fixation module to reweight the feature importance of the query and key before computing linear attention. We achieve state-of-the-art performance among linear video Transformers on three popular video classification benchmarks.
arXiv Detail & Related papers (2022-10-15T02:20:50Z)
Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition [140.66371549815034]
We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition. We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
arXiv Detail & Related papers (2021-12-09T03:05:19Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.