Related papers: VMatcher: State-Space Semi-Dense Local Feature Matching

VMatcher: State-Space Semi-Dense Local Feature Matching

URL: http://arxiv.org/abs/2507.23371v1
Date: Thu, 31 Jul 2025 09:39:16 GMT
Title: VMatcher: State-Space Semi-Dense Local Feature Matching
Authors: Ali Youssef,
Abstract summary: VMatcher is a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs.<n>VMatcher integrates Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer's attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher

Related papers

JamMa: Ultra-lightweight Local Feature Matching with Joint Mamba [8.878053726388075]
We propose an ultra-lightweight Mamba-based matcher, named JamMa, which converges on a single GPU and achieves an impressive performance-efficiency balance in inference.<n>To unlock the potential of Mamba for feature matching, we propose Joint Mamba with a scan-merge strategy named JEGO, which enables: (1) Joint scan of two images to achieve high-frequency mutual interaction, (2) Efficient scan with skip steps to reduce sequence length, (3) Global receptive field, and (4) Omnidirectional feature representation.
arXiv Detail & Related papers (2025-03-05T12:12:51Z)
ContextFormer: Redefining Efficiency in Semantic Segmentation [48.81126061219231]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.<n>Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.<n>We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z)
Efficient Self-Supervised Video Hashing with Selective State Spaces [63.83300352372051]
Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval.<n>We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm.
arXiv Detail & Related papers (2024-12-19T04:33:22Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. We propose the MobileMamba framework, which balances efficiency and performance. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z)
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed [42.861344584752]
Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. Our method can achieve higher accuracy compared with competitive semi-dense matchers.
arXiv Detail & Related papers (2024-03-07T18:58:40Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
ParaFormer: Parallel Attention Transformer for Efficient Feature Matching [8.552303361149612]
This paper proposes a novel parallel attention model entitled ParaFormer. It fuses features and keypoint positions through the concept of amplitude and phase, and integrates self- and cross-attention in a parallel manner. Experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that ParaFormer achieves state-of-the-art performance. The efficient ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of the existing attention-based models.
arXiv Detail & Related papers (2023-03-02T03:29:16Z)
ECO-TR: Efficient Correspondences Finding Via Coarse-to-Fine Refinement [80.94378602238432]
We propose an efficient structure named Correspondence Efficient Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner. To achieve this, multiple transformer blocks are stage-wisely connected to gradually refine the predicted coordinates. Experiments on various sparse and dense matching tasks demonstrate the superiority of our method in both efficiency and effectiveness against existing state-of-the-arts.
arXiv Detail & Related papers (2022-09-25T13:05:33Z)
Efficient Linear Attention for Fast and Accurate Keypoint Matching [0.9699586426043882]
Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. We propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints.
arXiv Detail & Related papers (2022-04-16T06:17:36Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.