Related papers: GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

URL: http://arxiv.org/abs/2406.18572v2
Date: Thu, 17 Oct 2024 03:25:47 GMT
Title: GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model
Authors: Ling Li, Yu Ye, Bingchuan Jiang, Wei Zeng,
Abstract summary: This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) Existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities.
Score: 6.135404769437841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. A primary challenge here is the scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable, leading to the creation of a new dataset comprising highly locatable street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources. The data and code are available at https://github.com/lingli1996/GeoReasoner.

Related papers

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models [14.178064117544082]
Image geolocalization is important for applications in crisis response, digital forensics, and location-based intelligence.<n>Recent advances in large language models (LLMs) offer new opportunities for visual reasoning.<n>We introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process.
arXiv Detail & Related papers (2025-08-03T06:04:33Z)
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models [27.848962405476108]
New pipeline constructs reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images.<n>We introduce GLOBE, Group-relative policy optimization for Locatability assessment and optimized visual-clue reasoning.<n>Results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks.
arXiv Detail & Related papers (2025-06-17T16:07:58Z)
Evaluating Precise Geolocation Inference Capabilities of Vision Language Models [0.0]
This paper introduces a benchmark dataset collected from Google Street View that represents its global distribution of coverage. Foundation models are evaluated on single-image geolocation inference, with many achieving median distance errors of 300 km. We further evaluate VLM "agents" with access to supplemental tools, observing up to a 30.6% decrease in distance error.
arXiv Detail & Related papers (2025-02-20T09:59:28Z)
Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components. GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric. We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
Image-Based Geolocation Using Large Vision-Language Models [19.071551941682063]
We introduce tool, an innovative framework that significantly enhances image-based geolocation accuracy. tool employs a systematic chain-of-thought (CoT) approach, mimicking human geoguessing strategies. It achieves an impressive average score of 4550.5 in the GeoGuessr game, with an 85.37% win rate, and delivers highly precise geolocation predictions.
arXiv Detail & Related papers (2024-08-18T13:39:43Z)
CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement [11.108860387261508]
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images taken at some unknown location, to a set of geo-tagged reference images. We develop CurriculumLoc, a novel keypoint detection and description with global semantic awareness and a local geometric verification. We achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively.
arXiv Detail & Related papers (2023-11-20T08:40:01Z)
GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z)
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z)
GeoNet: Benchmarking Unsupervised Adaptation across Geographies [71.23141626803287]
We study the problem of geographic robustness and make three main contributions. First, we introduce a large-scale dataset GeoNet for geographic adaptation. Second, we hypothesize that the major source of domain shifts arise from significant variations in scene context. Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures.
arXiv Detail & Related papers (2023-03-27T17:59:34Z)
GSV-Cities: Toward Appropriate Supervised Visual Place Recognition [3.6739949215165164]
We introduce GSV-Cities, a new image dataset providing the widest geographic coverage to date with highly accurate ground truth. We then explore the full potential of advances in deep metric learning to train networks specifically for place recognition. We establish a new state-of-the-art on large-scale benchmarks, such as Pittsburgh, Mapillary-SLS, SPED and Nordland.
arXiv Detail & Related papers (2022-10-19T01:39:29Z)
Learning Large-scale Location Embedding From Human Mobility Trajectories with Graphs [0.0]
This study learns vector representations for locations using the large-scale LBS data. This model embeds context information in human mobility and spatial information. GCN-L2V can be applied in a complementary manner to other place embedding methods and down-streaming Geo-aware applications.
arXiv Detail & Related papers (2021-02-23T09:11:33Z)
PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space. Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)
Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization [54.00111565818903]
Cross-view geo-localization is to spot images of the same geographic target from different platforms. Existing methods usually concentrate on mining the fine-grained feature of the geographic target in the image center. We introduce a simple and effective deep neural network, called Local Pattern Network (LPN), to take advantage of contextual information.
arXiv Detail & Related papers (2020-08-26T16:06:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.