Related papers: Where am I? Cross-View Geo-localization with Natural Language Descriptions

Where am I? Cross-View Geo-localization with Natural Language Descriptions

URL: http://arxiv.org/abs/2412.17007v1
Date: Sun, 22 Dec 2024 13:13:10 GMT
Title: Where am I? Cross-View Geo-localization with Natural Language Descriptions
Authors: Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Conghui He, Weijia Li,
Abstract summary: Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM.<n>We introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text.
Score: 16.870286138129902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response. In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details.Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://yejy53.github.io/CVG-Text/.

Related papers

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics. We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z)
ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
We propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples. We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets.
arXiv Detail & Related papers (2024-06-04T02:28:51Z)
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities. We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding. Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z)
CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations. CLIM consistently improves different open-vocabulary object detection methods. It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z)
Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts. We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion. Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
SLAN: Self-Locator Aided Network for Cross-Modal Understanding [89.20623874655352]
We propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. It achieves fairly competitive results on five cross-modal understanding tasks.
arXiv Detail & Related papers (2022-11-28T11:42:23Z)
Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities. We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z)
Telling the What while Pointing the Where: Fine-grained Mouse Trace and Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is. In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z)
StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes. Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.