Revisit Anything: Visual Place Recognition via Image Segment Retrieval
- URL: http://arxiv.org/abs/2409.18049v1
- Date: Thu, 26 Sep 2024 16:49:58 GMT
- Title: Revisit Anything: Visual Place Recognition via Image Segment Retrieval
- Authors: Kartik Garg, Sai Shubodh Puligilla, Shishir Kolathaya, Madhava
Krishna, Sourav Garg
- Abstract summary: Existing visual place recognition pipelines encode the "whole" image and search for matches.
We address this by encoding and searching for "image segments" instead of the whole images.
We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval.
- Score: 8.544326445217369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately recognizing a revisited place is crucial for embodied agents to
localize and navigate. This requires visual representations to be distinct,
despite strong variations in camera viewpoint and scene appearance. Existing
visual place recognition pipelines encode the "whole" image and search for
matches. This poses a fundamental challenge in matching two images of the same
place captured from different camera viewpoints: "the similarity of what
overlaps can be dominated by the dissimilarity of what does not overlap". We
address this by encoding and searching for "image segments" instead of the
whole images. We propose to use open-set image segmentation to decompose an
image into `meaningful' entities (i.e., things and stuff). This enables us to
create a novel image representation as a collection of multiple overlapping
subgraphs connecting a segment with its neighboring segments, dubbed
SuperSegment. Furthermore, to efficiently encode these SuperSegments into
compact vector representations, we propose a novel factorized representation of
feature aggregation. We show that retrieving these partial representations
leads to significantly higher recognition recall than the typical whole image
based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new
state-of-the-art in place recognition on a diverse selection of benchmark
datasets, while being applicable to both generic and task-specialized image
encoders. Finally, we demonstrate the potential of our method to ``revisit
anything'' by evaluating our method on an object instance retrieval task, which
bridges the two disparate areas of research: visual place recognition and
object-goal navigation, through their common aim of recognizing goal objects
specific to a place. Source code: https://github.com/AnyLoc/Revisit-Anything.
Related papers
- A Generative Approach for Wikipedia-Scale Visual Entity Recognition [56.55633052479446]
We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia.
We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
arXiv Detail & Related papers (2024-03-04T13:47:30Z) - Region-Based Representations Revisited [34.01784145403097]
We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2.
The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images.
arXiv Detail & Related papers (2024-02-04T05:33:04Z) - Self-Correlation and Cross-Correlation Learning for Few-Shot Remote
Sensing Image Semantic Segmentation [27.59330408178435]
Few-shot remote sensing semantic segmentation aims at learning to segment target objects from a query image.
We propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation.
Our model enhances the generalization by considering both self-correlation and cross-correlation between support and query images.
arXiv Detail & Related papers (2023-09-11T21:53:34Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Investigating the Role of Image Retrieval for Visual Localization -- An
exhaustive benchmark [46.166955777187816]
This paper focuses on understanding the role of image retrieval for multiple visual localization paradigms.
We introduce a novel benchmark setup and compare state-of-the-art retrieval representations on multiple datasets.
Using these tools and in-depth analysis, we show that retrieval performance on classical landmark retrieval or place recognition tasks correlates only for some but not all paradigms to localization performance.
arXiv Detail & Related papers (2022-05-31T12:59:01Z) - Unsupervised Part Discovery from Contrastive Reconstruction [90.88501867321573]
The goal of self-supervised visual representation learning is to learn strong, transferable image representations.
We propose an unsupervised approach to object part discovery and segmentation.
Our method yields semantic parts consistent across fine-grained but visually distinct categories.
arXiv Detail & Related papers (2021-11-11T17:59:42Z) - Locate then Segment: A Strong Pipeline for Referring Image Segmentation [73.19139431806853]
Referring image segmentation aims to segment the objects referred by a natural language expression.
Previous methods usually focus on designing an implicit and recurrent interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask.
We present a "Then-Then-Segment" scheme to tackle these problems.
Our framework is simple but surprisingly effective.
arXiv Detail & Related papers (2021-03-30T12:25:27Z) - Benchmarking Image Retrieval for Visual Localization [41.38065116577011]
Visual localization is a core component of technologies such as autonomous driving and augmented reality.
It is common practice to use state-of-the-art image retrieval algorithms for these tasks.
This paper focuses on understanding the role of image retrieval for multiple visual localization tasks.
arXiv Detail & Related papers (2020-11-24T07:59:52Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.