MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place Recognition
- URL: http://arxiv.org/abs/2412.09199v2
- Date: Fri, 13 Dec 2024 16:44:42 GMT
- Title: MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place Recognition
- Authors: Qiwen Gu, Xufei Wang, Fenglin Zhang, Junqiao Zhao, Siyue Tao, Chen Ye, Tiantian Feng, Changjun Jiang,
- Abstract summary: We introduce the mutual learning of viewpoint self-classification and Visual Place Recognition.
The dataset is partitioned in an unsupervised manner while simultaneously training a descriptor extractor for place recognition.
Our method even excels state-of-the-art (SOTA) methods that partition datasets using ground truth labels.
- Score: 13.681827205077727
- License:
- Abstract: Visual Place Recognition (VPR) aims to robustly identify locations by leveraging image retrieval based on descriptors encoded from environmental images. However, drastic appearance changes of images captured from different viewpoints at the same location pose incoherent supervision signals for descriptor learning, which severely hinder the performance of VPR. Previous work proposes classifying images based on manually defined rules or ground truth labels for viewpoints, followed by descriptor training based on the classification results. However, not all datasets have ground truth labels of viewpoints and manually defined rules may be suboptimal, leading to degraded descriptor performance.To address these challenges, we introduce the mutual learning of viewpoint self-classification and VPR. Starting from coarse classification based on geographical coordinates, we progress to finer classification of viewpoints using simple clustering techniques. The dataset is partitioned in an unsupervised manner while simultaneously training a descriptor extractor for place recognition. Experimental results show that this approach almost perfectly partitions the dataset based on viewpoints, thus achieving mutually reinforcing effects. Our method even excels state-of-the-art (SOTA) methods that partition datasets using ground truth labels.
Related papers
- Context-Based Visual-Language Place Recognition [4.737519767218666]
A popular approach to vision-based place recognition relies on low-level visual features.
We introduce a novel VPR approach that remains robust to scene changes and does not require additional training.
Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model.
arXiv Detail & Related papers (2024-10-25T06:59:11Z) - Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models [21.17975741743583]
It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions can significantly enhance zero-shot performance.
In this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image.
arXiv Detail & Related papers (2024-06-05T04:08:41Z) - Data-efficient Large Scale Place Recognition with Graded Similarity
Supervision [10.117451511942267]
Visual place recognition (VPR) is a fundamental task of computer vision for visual localization.
Existing methods are trained using image pairs that either depict the same place or not.
We deploy an automatic re-annotation strategy to re-label VPR datasets.
We propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks.
arXiv Detail & Related papers (2023-03-21T10:56:57Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - SCARF: Self-Supervised Contrastive Learning using Random Feature
Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features.
We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z) - Zero-Shot Recognition through Image-Guided Semantic Classification [9.291055558504588]
We present a new embedding-based framework for zero-shot learning (ZSL)
Motivated by the binary relevance method for multi-label classification, we propose to inversely learn the mapping between an image and a semantic classifier.
IGSC is conceptually simple and can be realized by a slight enhancement of an existing deep architecture for classification.
arXiv Detail & Related papers (2020-07-23T06:22:40Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z) - Self-Supervised Tuning for Few-Shot Segmentation [82.32143982269892]
Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples.
Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space.
This paper presents an adaptive framework tuning, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme.
arXiv Detail & Related papers (2020-04-12T03:53:53Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.