Related papers: Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence

Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence

URL: http://arxiv.org/abs/2410.09533v1
Date: Sat, 12 Oct 2024 13:45:26 GMT
Title: Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence
Authors: Felipe Cadar, Guilherme Potje, Renato Martins, Cédric Demonceaux, Erickson R. Nascimento,
Abstract summary: This paper presents a new method that uses semantic cues from foundation vision model features to enhance local feature matching. We present adapted versions of six existing descriptors, with an average increase in performance of 29% in camera localization.
Score: 12.602194710071116
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Visual correspondence is a crucial step in key computer vision tasks, including camera localization, image registration, and structure from motion. The most effective techniques for matching keypoints currently involve using learned sparse or dense matchers, which need pairs of images. These neural networks have a good general understanding of features from both images, but they often struggle to match points from different semantic areas. This paper presents a new method that uses semantic cues from foundation vision model features (like DINOv2) to enhance local feature matching by incorporating semantic reasoning into existing descriptors. Therefore, the learned descriptors do not require image pairs at inference time, allowing feature caching and fast matching using similarity search, unlike learned matchers. We present adapted versions of six existing descriptors, with an average increase in performance of 29% in camera localization, with comparable accuracy to existing matchers as LightGlue and LoFTR in two existing benchmarks. Both code and trained models are available at https://www.verlab.dcc.ufmg.br/descriptors/reasoning_accv24

Related papers

Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning [69.33115351856785]
We present a novel method, called T2I-PAL, to tackle the modality gap issue when using only text captions for PEFT.<n>The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions.<n>Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average.
arXiv Detail & Related papers (2025-06-12T11:09:49Z)
Improving the matching of deformable objects by learning to detect keypoints [6.4587163310833855]
We propose a novel learned keypoint detection method to increase the number of correct matches for the task of non-rigid image correspondence. We train an end-to-end convolutional neural network (CNN) to find keypoint locations that are more appropriate to the considered descriptor. Experiments demonstrate that our method enhances the Mean Matching Accuracy of numerous descriptors when used in conjunction with our detection method. We also apply our method on the complex real-world task object retrieval where our detector performs on par with the finest keypoint detectors currently available for this task.
arXiv Detail & Related papers (2023-09-01T13:02:19Z)
Learning to Detect Good Keypoints to Match Non-Rigid Objects in RGB Images [7.428474910083337]
We present a novel learned keypoint detection method designed to maximize the number of correct matches for the task of non-rigid image correspondence. Our training framework uses true correspondences, obtained by matching annotated image pairs with a predefined descriptor extractor, as a ground-truth to train a convolutional neural network (CNN) Experiments show that our method outperforms the state-of-the-art keypoint detector on real images of non-rigid objects by 20 p.p. on Mean Matching Accuracy.
arXiv Detail & Related papers (2022-12-13T11:59:09Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
ZippyPoint: Fast Interest Point Detection, Description, and Matching through Mixed Precision Discretization [71.91942002659795]
We investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms. ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size. These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization.
arXiv Detail & Related papers (2022-03-07T18:59:03Z)
Guide Local Feature Matching by Overlap Estimation [9.387323456222823]
We introduce a novel Overlap Estimation method conditioned on image pairs with TRansformer, named OETR. OETR performs overlap estimation in a two-step process of feature correlation and then overlap regression. Experiments show that OETR can boost state-of-the-art local feature matching performance substantially.
arXiv Detail & Related papers (2022-02-18T07:11:36Z)
Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching. We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
Learning Dynamic Alignment via Meta-filter for Few-shot Learning [94.41887992982986]
Few-shot learning aims to recognise new classes by adapting the learned knowledge with extremely limited few-shot (support) examples. We learn a dynamic alignment, which can effectively highlight both query regions and channels according to different local support information. The resulting framework establishes the new state-of-the-arts on major few-shot visual recognition benchmarks.
arXiv Detail & Related papers (2021-03-25T03:29:33Z)
Learning to Focus: Cascaded Feature Matching Network for Few-shot Image Recognition [38.49419948988415]
Deep networks can learn to accurately recognize objects of a category by training on a large number of images. A meta-learning challenge known as a low-shot image recognition task comes when only a few images with annotations are available for learning a recognition model for one category. Our method, called Cascaded Feature Matching Network (CFMN), is proposed to solve this problem. Experiments for few-shot learning on two standard datasets, emphminiImageNet and Omniglot, have confirmed the effectiveness of our method.
arXiv Detail & Related papers (2021-01-13T11:37:28Z)
Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision. We propose to leverage pixel-level similarities across different objects for learning more accurate object locations. Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z)
Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match. The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
Augmented Bi-path Network for Few-shot Learning [16.353228724916505]
We propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness.
arXiv Detail & Related papers (2020-07-15T11:13:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.