MatchFormer: Interleaving Attention in Transformers for Feature Matching
- URL: http://arxiv.org/abs/2203.09645v1
- Date: Thu, 17 Mar 2022 22:49:14 GMT
- Title: MatchFormer: Interleaving Attention in Transformers for Feature Matching
- Authors: Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen
- Abstract summary: We propose a novel hierarchical extract-and-match transformer, termed as MatchFormer.
We interleave self-attention for feature extraction and cross-attention for feature matching, enabling a human-intuitive extract-and-match scheme.
Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision.
- Score: 31.175513306917654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Local feature matching is a computationally intensive task at the subpixel
level. While detector-based methods coupled with feature descriptors struggle
in low-texture scenes, CNN-based methods with a sequential extract-to-match
pipeline, fail to make use of the matching capacity of the encoder and tend to
overburden the decoder for matching. In contrast, we propose a novel
hierarchical extract-and-match transformer, termed as MatchFormer. Inside each
stage of the hierarchical encoder, we interleave self-attention for feature
extraction and cross-attention for feature matching, enabling a human-intuitive
extract-and-match scheme. Such a match-aware encoder releases the overloaded
decoder and makes the model highly efficient. Further, combining self- and
cross-attention on multi-scale features in a hierarchical architecture improves
matching robustness, particularly in low-texture indoor scenes or with less
outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win
solution in efficiency, robustness, and precision. Compared to the previous
best method in indoor pose estimation, our lite MatchFormer has only 45%
GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The
large MatchFormer reaches state-of-the-art on four different benchmarks,
including indoor pose estimation (ScanNet), outdoor pose estimation
(MegaDepth), homography estimation and image matching (HPatch), and visual
localization (InLoc). Code will be made publicly available at
https://github.com/jamycheung/MatchFormer.
Related papers
- No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images.
Our model achieves real-time 3D Gaussian reconstruction during inference.
This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z) - Grounding Image Matching in 3D with MASt3R [8.14650201701567]
We propose to cast matching as a 3D task with DUSt3R, a powerful 3D reconstruction framework based on Transformers.
We propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss.
Our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks.
arXiv Detail & Related papers (2024-06-14T06:46:30Z) - PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator.
We create a new training pipeline for object to image matching based on a three-view system.
To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z) - DeepMatcher: A Deep Transformer-based Network for Robust and Accurate
Local Feature Matching [9.662752427139496]
We propose a deep Transformer-based network built upon our investigation of local feature matching in detector-free methods.
DeepMatcher captures more human-intuitive and simpler-to-match features.
We show that DeepMatcher significantly outperforms the state-of-the-art methods on several benchmarks.
arXiv Detail & Related papers (2023-01-08T07:15:09Z) - NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera
Localization [60.73541222862195]
NeuMap is an end-to-end neural mapping method for camera localization.
It encodes a whole scene into a grid of latent codes, with which a Transformer-based auto-decoder regresses 3D coordinates of query pixels.
arXiv Detail & Related papers (2022-11-21T04:46:22Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Adaptive Assignment for Geometry Aware Local Feature Matching [22.818457285745733]
detector-free feature matching approaches are currently attracting great attention thanks to their excellent performance.
We introduce AdaMatcher, which accomplishes the feature correlation and co-visible area estimation through an elaborate feature interaction module.
AdaMatcher then performs adaptive assignment on patch-level matching while estimating the scales between images, and finally refines the co-visible matches through scale alignment and sub-pixel regression module.
arXiv Detail & Related papers (2022-07-18T08:22:18Z) - Learning Tracking Representations via Dual-Branch Fully Transformer
Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking.
We extract a feature vector for each patch based on its matching results with others within an attention window.
The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z) - DFM: A Performance Baseline for Deep Feature Matching [10.014010310188821]
The proposed method uses pre-trained VGG architecture as a feature extractor and does not require any additional training specific to improve matching.
Our algorithm achieves 0.57 and 0.80 overall scores in terms of Mean Matching Accuracy (MMA) for 1 pixel and 2 pixels thresholds respectively on Hpatches dataset.
arXiv Detail & Related papers (2021-06-14T22:55:06Z) - DeepI2P: Image-to-Point Cloud Registration via Deep Classification [71.3121124994105]
DeepI2P is a novel approach for cross-modality registration between an image and a point cloud.
Our method estimates the relative rigid transformation between the coordinate frames of the camera and Lidar.
We circumvent the difficulty by converting the registration problem into a classification and inverse camera projection optimization problem.
arXiv Detail & Related papers (2021-04-08T04:27:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.