GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
- URL: http://arxiv.org/abs/2506.00785v1
- Date: Sun, 01 Jun 2025 02:24:46 GMT
- Title: GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
- Authors: Sahiti Yerramilli, Nilay Pande, Rynaa Grover, Jayant Sravan Tamarapalli,
- Abstract summary: GeoChain is a benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs)<n>It pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs)<n>These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.
Related papers
- GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving [55.14836667214487]
GeoFocus is a novel framework comprising two core modules.<n>GeoFocus achieves a 4.7% accuracy improvement over leading specialized models.<n>It demonstrates superior robustness in MATHVERSE under diverse visual conditions.
arXiv Detail & Related papers (2026-02-09T11:15:01Z) - Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery [18.7420518276348]
Geo3DVQA is a benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning.<n>Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns.
arXiv Detail & Related papers (2025-12-08T08:16:14Z) - GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks [15.659980269049798]
Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata.<n>Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes.<n>We propose textbfGraphGeo, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization.
arXiv Detail & Related papers (2025-11-02T11:58:55Z) - Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z) - GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings [3.43519422766841]
We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation.<n>Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets.
arXiv Detail & Related papers (2025-10-01T20:39:48Z) - GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z) - Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models [27.848962405476108]
New pipeline constructs reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images.<n>We introduce GLOBE, Group-relative policy optimization for Locatability assessment and optimized visual-clue reasoning.<n>Results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks.
arXiv Detail & Related papers (2025-06-17T16:07:58Z) - GeoLocSFT: Efficient Visual Geolocation via Supervised Fine-Tuning of Multimodal Foundation Models [4.956977275061966]
GeoLocSFT is trained with only 2700 carefully selected image-GPS pairs from our geographically diverse MR600k dataset.<n>Despite this limited data, our SFT-centric approach substantially improves over baseline models.<n>Our findings highlight the power of high-quality supervision and efficient SFT for planet-scale image geolocation.
arXiv Detail & Related papers (2025-06-02T03:16:19Z) - GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization [30.983556433953076]
We propose GeoRanker, a distance-aware ranking framework for image geolocalization.<n>We introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships.<n>GeoRanker achieves state-of-the-art results on two well-established benchmarks.
arXiv Detail & Related papers (2025-05-19T21:04:46Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [64.58959634712215]
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.<n>Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.<n>To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding.
arXiv Detail & Related papers (2025-01-10T18:59:42Z) - MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models [7.422346909538787]
We introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning.<n>MapEval consists of 700 unique multiple-choice questions about locations across 180 cities and 54 countries.<n>Our detailed analyses provide insights into the strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average.<n>This gap highlights MapEval's critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.
arXiv Detail & Related papers (2024-12-31T07:20:32Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models [40.69217368870192]
We propose a novel framework for worldwide geolocalization based on Retrieval-Augmented Generation (RAG)
G3 consists of three steps, i.e., Geo-alignment, Geo-diversification, and Geo-verification.
Experiments on two well-established datasets verify the superiority of G3 compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-05-23T15:37:06Z) - CurriculumLoc: Enhancing Cross-Domain Geolocalization through
Multi-Stage Refinement [11.108860387261508]
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images taken at some unknown location, to a set of geo-tagged reference images.
We develop CurriculumLoc, a novel keypoint detection and description with global semantic awareness and a local geometric verification.
We achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively.
arXiv Detail & Related papers (2023-11-20T08:40:01Z) - GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - GeoNet: Benchmarking Unsupervised Adaptation across Geographies [71.23141626803287]
We study the problem of geographic robustness and make three main contributions.
First, we introduce a large-scale dataset GeoNet for geographic adaptation.
Second, we hypothesize that the major source of domain shifts arise from significant variations in scene context.
Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures.
arXiv Detail & Related papers (2023-03-27T17:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.