Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
- URL: http://arxiv.org/abs/2510.10880v1
- Date: Mon, 13 Oct 2025 01:12:21 GMT
- Title: Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
- Authors: Zhaofang Qian, Hardy Chen, Zeyu Wang, Li Zhang, Zijun Wang, Xiaoke Huang, Hui Liu, Xianfeng Tang, Zeyu Zheng, Haoqin Tu, Cihang Xie, Yuyin Zhou,
- Abstract summary: Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
- Score: 61.03549470159347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions, a task that is challenging and of demand in real life, has not been comprehensively evaluated. We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use. EarthWhere comprises 810 globally distributed images across two complementary geolocation scales: WhereCountry (i.e., 500 multiple-choice question-answering, with country-level answer and panoramas) and WhereStreet (i.e., 310 fine-grained street-level identification tasks requiring multi-step reasoning with optional web search). For evaluation, we adopt the final-prediction metrics: location accuracies within k km (Acc@k) for coordinates and hierarchical path scores for textual localization. Beyond this, we propose to explicitly score intermediate reasoning chains using human-verified key visual clues and a Shapley-reweighted thinking score that attributes credit to each clue's marginal contribution. We benchmark 13 state-of-the-art VLMs with web searching tools on our EarthWhere and report different types of final answer accuracies as well as the calibrated model thinking scores. Overall, Gemini-2.5-Pro achieves the best average accuracy at 56.32%, while the strongest open-weight model, GLM-4.5V, reaches 34.71%. We reveal that web search and reasoning do not guarantee improved performance when visual clues are limited, and models exhibit regional biases, achieving up to 42.7% higher scores in certain areas than others. These findings highlight not only the promise but also the persistent challenges of models to mitigate bias and achieve robust, fine-grained localization. We open-source our benchmark at https://github.com/UCSC-VLAA/EarthWhere.
Related papers
- Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery [18.7420518276348]
Geo3DVQA is a benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning.<n>Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns.
arXiv Detail & Related papers (2025-12-08T08:16:14Z) - GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization [32.342417136518286]
Image geolocalization aims to predict the geographic location of images captured anywhere on Earth.<n>Current evaluation methodologies suffer from two major limitations.<n>We propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks.
arXiv Detail & Related papers (2025-09-04T15:52:04Z) - From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models [14.178064117544082]
Image geolocalization is important for applications in crisis response, digital forensics, and location-based intelligence.<n>Recent advances in large language models (LLMs) offer new opportunities for visual reasoning.<n>We introduce a benchmark called IMAGEO-Bench that systematically evaluates accuracy, distance error, geospatial bias, and reasoning process.
arXiv Detail & Related papers (2025-08-03T06:04:33Z) - VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization [24.433604332415204]
We propose a novel hybrid geo-localization framework that combines the strengths of vision-language models and visual place recognition.<n>We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2025-07-23T12:23:03Z) - Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models [47.98900725310249]
New pipeline constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images.<n>GLOBE incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy.<n>Results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks.
arXiv Detail & Related papers (2025-06-17T16:07:58Z) - Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.<n>GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.<n>We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models [7.422346909538787]
MapEval is a benchmark designed to assess foundation models across three distinct tasks.<n>It covers spatial relationships, navigation, travel planning, and real-world map interactions.<n>It requires models to handle long-context reasoning, API interactions, and visual map analysis.
arXiv Detail & Related papers (2024-12-31T07:20:32Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - GeoCLIP: Clip-Inspired Alignment between Locations and Images for
Effective Worldwide Geo-localization [61.10806364001535]
Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth.
Existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task.
We propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations.
arXiv Detail & Related papers (2023-09-27T20:54:56Z) - PIGEON: Predicting Image Geolocations [44.99833362998488]
We present a new geolocalization system that combines semantic geocell creation, multi-task contrastive pretraining, and a novel loss function.
PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places.
arXiv Detail & Related papers (2023-07-11T23:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.