VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning
- URL: http://arxiv.org/abs/2312.08774v3
- Date: Thu, 4 Jan 2024 06:01:35 GMT
- Title: VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning
- Authors: Tangfei Liao, Xiaoqin Zhang, Li Zhao, Tao Wang, Guobao Xiao
- Abstract summary: Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences.
We propose a Visual-Spatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately.
- Score: 22.0082111649259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Correspondence pruning aims to find correct matches (inliers) from an initial
set of putative correspondences, which is a fundamental task for many
applications. The process of finding is challenging, given the varying inlier
ratios between scenes/image pairs due to significant visual differences.
However, the performance of the existing methods is usually limited by the
problem of lacking visual cues (\eg texture, illumination, structure) of
scenes. In this paper, we propose a Visual-Spatial Fusion Transformer
(VSFormer) to identify inliers and recover camera poses accurately. Firstly, we
obtain highly abstract visual cues of a scene with the cross attention between
local features of two-view images. Then, we model these visual cues and
correspondences by a joint visual-spatial fusion module, simultaneously
embedding visual cues into correspondences for pruning. Additionally, to mine
the consistency of correspondences, we also design a novel module that combines
the KNN-based graph and the transformer, effectively capturing both local and
global contexts. Extensive experiments have demonstrated that the proposed
VSFormer outperforms state-of-the-art methods on outdoor and indoor benchmarks.
Our code is provided at the following repository:
https://github.com/sugar-fly/VSFormer.
Related papers
- Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Optimal Transport Aggregation for Visual Place Recognition [9.192660643226372]
We introduce SALAD, which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem.
In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative.
Our single-stage method surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost.
arXiv Detail & Related papers (2023-11-27T15:46:19Z) - Progressive Visual Prompt Learning with Contrastive Feature Re-formation [15.385630262368661]
We propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers.
Our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method.
To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.
arXiv Detail & Related papers (2023-04-17T15:54:10Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Few-shot Visual Relationship Co-localization [1.4130726713527195]
Given a small bag of images, each containing a common but latent predicate, we are interested in localizing visual subject-object pairs connected via the common predicate in each of the images.
We propose an optimization framework to select a common visual relationship in each image of the bag.
We extensively evaluate our proposed framework on variations of bag sizes obtained from two challenging public datasets.
arXiv Detail & Related papers (2021-08-26T07:19:57Z) - Patch2Pix: Epipolar-Guided Pixel-Level Correspondences [38.38520763114715]
We present Patch2Pix, a novel refinement network that refines match proposals by regressing pixel-level matches from the local regions defined by those proposals.
We show that our refinement network significantly improves the performance of correspondence networks on image matching, homography estimation, and localization tasks.
arXiv Detail & Related papers (2020-12-03T13:44:02Z) - Devil's in the Details: Aligning Visual Clues for Conditional Embedding
in Person Re-Identification [94.77172127405846]
We propose two key recognition patterns to better utilize the detail information of pedestrian images.
CACE-Net achieves state-of-the-art performance on three public datasets.
arXiv Detail & Related papers (2020-09-11T06:28:56Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.