Related papers: Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers

URL: http://arxiv.org/abs/2410.06614v2
Date: Sun, 02 Mar 2025 08:59:29 GMT
Title: Pair-VPR: Place-Aware Pre-training and Contrastive Pair Classification for Visual Place Recognition with Vision Transformers
Authors: Stephen Hausler, Peyman Moghadam,
Abstract summary: We propose a novel joint training method for Visual Place Recognition (VPR)<n>The pair classifier can predict whether a given pair of images are from the same place or not.<n>By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance.
Score: 6.890658812702241
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work we propose a novel joint training method for Visual Place Recognition (VPR), which simultaneously learns a global descriptor and a pair classifier for re-ranking. The pair classifier can predict whether a given pair of images are from the same place or not. The network only comprises Vision Transformer components for both the encoder and the pair classifier, and both components are trained using their respective class tokens. In existing VPR methods, typically the network is initialized using pre-trained weights from a generic image dataset such as ImageNet. In this work we propose an alternative pre-training strategy, by using Siamese Masked Image Modelling as a pre-training task. We propose a Place-aware image sampling procedure from a collection of large VPR datasets for pre-training our model, to learn visual features tuned specifically for VPR. By re-using the Mask Image Modelling encoder and decoder weights in the second stage of training, Pair-VPR can achieve state-of-the-art VPR performance across five benchmark datasets with a ViT-B encoder, along with further improvements in localization recall with larger encoders. The Pair-VPR website is: https://csiro-robotics.github.io/Pair-VPR.

Related papers

VDNA-PR: Using General Dataset Representations for Robust Sequential Visual Place Recognition [17.393105901701098]
This paper adapts a general dataset representation technique to produce robust Visual Place Recognition (VPR) descriptors. Our experiments show that our representation can allow for better robustness than current solutions to serious domain shifts away from the training data distribution.
arXiv Detail & Related papers (2024-03-14T01:30:28Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment [61.769441954135246]
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
arXiv Detail & Related papers (2023-12-12T03:39:07Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples. Our memory models the distribution of past keys and values through the definition of prototype vectors. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z)
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one. We use features from the OpenAI CLIP model to tackle the considered task. We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z)
Data-efficient Large Scale Place Recognition with Graded Similarity Supervision [10.117451511942267]
Visual place recognition (VPR) is a fundamental task of computer vision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. We deploy an automatic re-annotation strategy to re-label VPR datasets. We propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks.
arXiv Detail & Related papers (2023-03-21T10:56:57Z)
Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision [49.905448429974804]
We consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. We propose a transformer-based model for OVS, termed as OVSegmentor, which exploits web-crawled image-text pairs for pre-training. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training.
arXiv Detail & Related papers (2023-01-22T13:10:05Z)
Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm. Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks. We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.