VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning
- URL: http://arxiv.org/abs/2312.08774v3
- Date: Thu, 4 Jan 2024 06:01:35 GMT
- Title: VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning
- Authors: Tangfei Liao, Xiaoqin Zhang, Li Zhao, Tao Wang, Guobao Xiao
- Abstract summary: Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences.
We propose a Visual-Spatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately.
- Score: 22.0082111649259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Correspondence pruning aims to find correct matches (inliers) from an initial
set of putative correspondences, which is a fundamental task for many
applications. The process of finding is challenging, given the varying inlier
ratios between scenes/image pairs due to significant visual differences.
However, the performance of the existing methods is usually limited by the
problem of lacking visual cues (\eg texture, illumination, structure) of
scenes. In this paper, we propose a Visual-Spatial Fusion Transformer
(VSFormer) to identify inliers and recover camera poses accurately. Firstly, we
obtain highly abstract visual cues of a scene with the cross attention between
local features of two-view images. Then, we model these visual cues and
correspondences by a joint visual-spatial fusion module, simultaneously
embedding visual cues into correspondences for pruning. Additionally, to mine
the consistency of correspondences, we also design a novel module that combines
the KNN-based graph and the transformer, effectively capturing both local and
global contexts. Extensive experiments have demonstrated that the proposed
VSFormer outperforms state-of-the-art methods on outdoor and indoor benchmarks.
Our code is provided at the following repository:
https://github.com/sugar-fly/VSFormer.
Related papers
- Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms [27.882122236282054]
We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2.
We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions.
Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs.
arXiv Detail & Related papers (2024-09-25T11:55:27Z) - VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We build an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Optimal Transport Aggregation for Visual Place Recognition [9.192660643226372]
We introduce SALAD, which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem.
In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative.
Our single-stage method surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost.
arXiv Detail & Related papers (2023-11-27T15:46:19Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Patch2Pix: Epipolar-Guided Pixel-Level Correspondences [38.38520763114715]
We present Patch2Pix, a novel refinement network that refines match proposals by regressing pixel-level matches from the local regions defined by those proposals.
We show that our refinement network significantly improves the performance of correspondence networks on image matching, homography estimation, and localization tasks.
arXiv Detail & Related papers (2020-12-03T13:44:02Z) - Devil's in the Details: Aligning Visual Clues for Conditional Embedding
in Person Re-Identification [94.77172127405846]
We propose two key recognition patterns to better utilize the detail information of pedestrian images.
CACE-Net achieves state-of-the-art performance on three public datasets.
arXiv Detail & Related papers (2020-09-11T06:28:56Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.