Related papers: MatchFormer: Interleaving Attention in Transformers for Feature Matching

MatchFormer: Interleaving Attention in Transformers for Feature Matching

URL: http://arxiv.org/abs/2203.09645v1
Date: Thu, 17 Mar 2022 22:49:14 GMT
Title: MatchFormer: Interleaving Attention in Transformers for Feature Matching
Authors: Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, Rainer Stiefelhagen
Abstract summary: We propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. We interleave self-attention for feature extraction and cross-attention for feature matching, enabling a human-intuitive extract-and-match scheme. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision.
Score: 31.175513306917654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, enabling a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc). Code will be made publicly available at https://github.com/jamycheung/MatchFormer.

Related papers

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images. Our model achieves real-time 3D Gaussian reconstruction during inference. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z)
Grounding Image Matching in 3D with MASt3R [8.14650201701567]
We propose to cast matching as a 3D task with DUSt3R, a powerful 3D reconstruction framework based on Transformers. We propose to augment the DUSt3R network with a new head that outputs dense local features, trained with an additional matching loss. Our approach, coined MASt3R, significantly outperforms the state of the art on multiple matching tasks.
arXiv Detail & Related papers (2024-06-14T06:46:30Z)
PoseMatcher: One-shot 6D Object Pose Estimation by Deep Feature Matching [51.142988196855484]
We propose PoseMatcher, an accurate model free one-shot object pose estimator. We create a new training pipeline for object to image matching based on a three-view system. To enable PoseMatcher to attend to distinct input modalities, an image and a pointcloud, we introduce IO-Layer.
arXiv Detail & Related papers (2023-04-03T21:14:59Z)
DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching [9.662752427139496]
We propose a deep Transformer-based network built upon our investigation of local feature matching in detector-free methods. DeepMatcher captures more human-intuitive and simpler-to-match features. We show that DeepMatcher significantly outperforms the state-of-the-art methods on several benchmarks.
arXiv Detail & Related papers (2023-01-08T07:15:09Z)
NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization [60.73541222862195]
NeuMap is an end-to-end neural mapping method for camera localization. It encodes a whole scene into a grid of latent codes, with which a Transformer-based auto-decoder regresses 3D coordinates of query pixels.
arXiv Detail & Related papers (2022-11-21T04:46:22Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Adaptive Assignment for Geometry Aware Local Feature Matching [22.818457285745733]
detector-free feature matching approaches are currently attracting great attention thanks to their excellent performance. We introduce AdaMatcher, which accomplishes the feature correlation and co-visible area estimation through an elaborate feature interaction module. AdaMatcher then performs adaptive assignment on patch-level matching while estimating the scales between images, and finally refines the co-visible matches through scale alignment and sub-pixel regression module.
arXiv Detail & Related papers (2022-07-18T08:22:18Z)
Learning Tracking Representations via Dual-Branch Fully Transformer Networks [82.21771581817937]
We present a Siamese-like Dual-branch network based on solely Transformers for tracking. We extract a feature vector for each patch based on its matching results with others within an attention window. The method achieves better or comparable results as the best-performing methods.
arXiv Detail & Related papers (2021-12-05T13:44:33Z)
DFM: A Performance Baseline for Deep Feature Matching [10.014010310188821]
The proposed method uses pre-trained VGG architecture as a feature extractor and does not require any additional training specific to improve matching. Our algorithm achieves 0.57 and 0.80 overall scores in terms of Mean Matching Accuracy (MMA) for 1 pixel and 2 pixels thresholds respectively on Hpatches dataset.
arXiv Detail & Related papers (2021-06-14T22:55:06Z)
DeepI2P: Image-to-Point Cloud Registration via Deep Classification [71.3121124994105]
DeepI2P is a novel approach for cross-modality registration between an image and a point cloud. Our method estimates the relative rigid transformation between the coordinate frames of the camera and Lidar. We circumvent the difficulty by converting the registration problem into a classification and inverse camera projection optimization problem.
arXiv Detail & Related papers (2021-04-08T04:27:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.