FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer
- URL: http://arxiv.org/abs/2310.13605v1
- Date: Fri, 20 Oct 2023 15:54:18 GMT
- Title: FMRT: Learning Accurate Feature Matching with Reconciliatory Transformer
- Authors: Xinyu Zhang, Li Wang, Zhiqiang Jiang, Kun Dai, Tao Xie, Lei Yang,
Wenhao Yu, Yang Shen, Jun Li
- Abstract summary: We propose Feature Matching with Reconciliatory Transformer (FMRT), a detector-free method that reconciles different features with multiple receptive fields adaptively.
FMRT yields extraordinary performance on multiple benchmarks, including pose estimation, visual localization, homography estimation, and image matching.
- Score: 29.95553680263075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Local Feature Matching, an essential component of several computer vision
tasks (e.g., structure from motion and visual localization), has been
effectively settled by Transformer-based methods. However, these methods only
integrate long-range context information among keypoints with a fixed receptive
field, which constrains the network from reconciling the importance of features
with different receptive fields to realize complete image perception, hence
limiting the matching accuracy. In addition, these methods utilize a
conventional handcrafted encoding approach to integrate the positional
information of keypoints into the visual descriptors, which limits the
capability of the network to extract reliable positional encoding message. In
this study, we propose Feature Matching with Reconciliatory Transformer (FMRT),
a novel Transformer-based detector-free method that reconciles different
features with multiple receptive fields adaptively and utilizes parallel
networks to realize reliable positional encoding. Specifically, FMRT proposes a
dedicated Reconciliatory Transformer (RecFormer) that consists of a Global
Perception Attention Layer (GPAL) to extract visual descriptors with different
receptive fields and integrate global context information under various scales,
Perception Weight Layer (PWL) to measure the importance of various receptive
fields adaptively, and Local Perception Feed-forward Network (LPFFN) to extract
deep aggregated multi-scale local feature representation. Extensive experiments
demonstrate that FMRT yields extraordinary performance on multiple benchmarks,
including pose estimation, visual localization, homography estimation, and
image matching.
Related papers
- RADA: Robust and Accurate Feature Learning with Domain Adaptation [7.905594146253435]
We introduce a multi-level feature aggregation network that incorporates two pivotal components to facilitate the learning of robust and accurate features.
Our method, RADA, achieves excellent results in image matching, camera pose estimation, and visual localization tasks.
arXiv Detail & Related papers (2024-07-22T16:49:58Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Transformer-based Context Condensation for Boosting Feature Pyramids in
Object Detection [77.50110439560152]
Current object detectors typically have a feature pyramid (FP) module for multi-level feature fusion (MFF)
We propose a novel and efficient context modeling mechanism that can help existing FPs deliver better MFF results.
In particular, we introduce a novel insight that comprehensive contexts can be decomposed and condensed into two types of representations for higher efficiency.
arXiv Detail & Related papers (2022-07-14T01:45:03Z) - Semantic Labeling of High Resolution Images Using EfficientUNets and
Transformers [5.177947445379688]
We propose a new segmentation model that combines convolutional neural networks with deep transformers.
Our results demonstrate that the proposed methodology improves segmentation accuracy compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-06-20T12:03:54Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - TransVPR: Transformer-based place recognition with multi-level attention
aggregation [9.087163485833058]
We introduce a novel holistic place recognition model, TransVPR, based on vision Transformers.
TransVPR achieves state-of-the-art performance on several real-world benchmarks.
arXiv Detail & Related papers (2022-01-06T10:20:24Z) - LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution
Homography Estimation [52.63874513999119]
Cross-resolution image alignment is a key problem in multiscale giga photography.
Existing deep homography methods neglecting the explicit formulation of correspondences between them, which leads to degraded accuracy in cross-resolution challenges.
We propose a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs.
arXiv Detail & Related papers (2021-06-08T02:51:45Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - Multimodality Biomedical Image Registration using Free Point Transformer
Networks [0.37501702548174964]
We describe a point-set registration algorithm based on a novel free point transformer (FPT) network.
FPT is constructed with a global feature extractor which accepts unordered source and target point-sets of variable size.
In a multimodal registration task using prostate MR and sparsely acquired ultrasound images, FPT yields comparable or improved results.
arXiv Detail & Related papers (2020-08-05T00:13:04Z) - Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT)
FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts.
We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z) - Volumetric Transformer Networks [88.85542905676712]
We introduce a learnable module, the volumetric transformer network (VTN)
VTN predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely.
Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval.
arXiv Detail & Related papers (2020-07-18T14:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.