Related papers: EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

URL: http://arxiv.org/abs/2405.18065v2
Date: Sun, 02 Feb 2025 22:46:41 GMT
Title: EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition
Authors: Issar Tzachor, Boaz Lerner, Matan Levy, Michael Green, Tal Berkovitz Shalev, Gavriel Habib, Dvir Samuel, Noam Korngut Zailer, Or Shimshi, Nir Darshan, Rami Ben-Ari,
Abstract summary: We present an effective approach to harness the potential of a foundation model for Visual Place Recognition.<n>We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting.<n>Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance.
Score: 6.996304653818122
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with impressive feature compactness down to 128D. Moreover, integrating our local foundation features for re-ranking further widens this performance gap. Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance, while handling challenging conditions such as occlusion, day-night transitions, and seasonal variations.

Related papers

Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition [26.27331683733832]
Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. Existing methods lack precisely modeling and full exploitation of discriminative regions. We propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously.
arXiv Detail & Related papers (2025-04-14T05:04:51Z)
SelaVPR++: Towards Seamless Adaptation of Foundation Models for Efficient Place Recognition [69.58329995485158]
Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance. We propose a novel method to realize seamless adaptation of foundation models to VPR. In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++.
arXiv Detail & Related papers (2025-02-23T15:01:09Z)
Adaptive Masking Enhances Visual Grounding [12.793586888511978]
We propose IMAGE, Interpretative MAsking with Gaussian radiation modEling, to enhance vocabulary grounding in low-shot learning scenarios. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks.
arXiv Detail & Related papers (2024-10-04T05:48:02Z)
ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline. Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z)
Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR) Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Towards Robust and Accurate Visual Prompting [11.918195429308035]
We study whether a visual prompt derived from a robust model can inherit the robustness while suffering from the generalization performance decline. We introduce a novel technique named Prompt Boundary Loose (PBL) to effectively mitigates the suboptimal results of visual prompt on standard accuracy. Our findings are universal and demonstrate the significant benefits of our proposed method.
arXiv Detail & Related papers (2023-11-18T07:00:56Z)
AnyLoc: Towards Universal Visual Place Recognition [12.892386791383025]
Visual Place Recognition (VPR) is vital for robot localization. Most performant VPR approaches are environment- and task-specific. We develop a universal solution to VPR -- a technique that works across a broad range of structured and unstructured environments.
arXiv Detail & Related papers (2023-08-01T17:45:13Z)
Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models. We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models. Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
DATa: Domain Adaptation-Aided Deep Table Detection Using Visual-Lexical Representations [2.542864854772221]
We present a novel Domain Adaptation-aided deep Table detection method called DATa. It guarantees satisfactory performance in a specific target domain where few trusted labels are available. Experiments show that DATa substantially outperforms competing methods that only utilize visual representations in the target domain.
arXiv Detail & Related papers (2022-11-12T12:14:16Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)
Deep SIMBAD: Active Landmark-based Self-localization Using Ranking -based Scene Descriptor [5.482532589225552]
We consider an active self-localization task by an active observer and present a novel reinforcement learning (RL)-based next-best-view (NBV) planner. Experiments using the public NCLT dataset validated the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-09-06T23:51:27Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.