Related papers: Around the World in 24 Hours: Probing LLM Knowledge of Time and Place

Around the World in 24 Hours: Probing LLM Knowledge of Time and Place

URL: http://arxiv.org/abs/2506.03984v1
Date: Wed, 04 Jun 2025 14:14:28 GMT
Title: Around the World in 24 Hours: Probing LLM Knowledge of Time and Place
Authors: Carolin Holtermann, Paul Röttger, Anne Lauscher,
Abstract summary: We present the first evaluation of the ability of language models to jointly reason over time and space.<n>We evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge.<n>We do not find clear correlations of performance with specific geographic regions.
Score: 18.17538075862074
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Reasoning over time and space is essential for understanding our world. However, the abilities of language models in this area are largely unexplored as previous work has tested their abilities for logical reasoning in terms of time and space in isolation or only in simple or artificial environments. In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space. To enable our analysis, we create GeoTemp, a dataset of 320k prompts covering 289 cities in 217 countries and 37 time zones. Using GeoTemp, we evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge. We find that most models perform well on reasoning tasks involving only temporal knowledge and that overall performance improves with scale. However, performance remains constrained in tasks that require connecting temporal and geographical information. We do not find clear correlations of performance with specific geographic regions. Instead, we find a significant performance increase for location names with low model perplexity, suggesting their repeated occurrence during model training. We further demonstrate that their performance is heavily influenced by prompt formulation - a direct injection of geographical knowledge leads to performance gains, whereas, surprisingly, techniques like chain-of-thought prompting decrease performance on simpler tasks.

Related papers

GTPred: Benchmarking MLLMs for Interpretable Geo-localization and Time-of-capture Prediction [21.94131531384186]
We introduce GTPred, a novel benchmark for geo-temporal prediction.<n>We evaluate MLLM predictions by jointly considering year and hierarchical location sequence matching.<n>Results also demonstrate that incorporating temporal information significantly enhances location inference performance.
arXiv Detail & Related papers (2026-01-19T16:34:25Z)
Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales [61.03549470159347]
Vision-language models (VLMs) have advanced rapidly, yet their capacity for image-grounded geolocation in open-world conditions has not been comprehensively evaluated.<n>We present EarthWhere, a comprehensive benchmark for VLM image geolocation that evaluates visual recognition, step-by-step reasoning, and evidence use.
arXiv Detail & Related papers (2025-10-13T01:12:21Z)
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains [11.704082783192467]
Geo Reason Enhancement (GRE) Suite is a novel framework that augments Visual Language Models with structured reasoning chains for interpretable location inference.<n>First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis.<n>Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision.
arXiv Detail & Related papers (2025-05-24T13:48:57Z)
Geospatial Mechanistic Interpretability of Large Language Models [6.0272491755196045]
Large Language Models (LLMs) have demonstrated unprecedented capabilities across various natural language processing tasks.<n>Our aim is to advance our understanding of the internal representations that these complex models generate while processing geographical information.
arXiv Detail & Related papers (2025-05-06T09:40:06Z)
TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time [9.745912505259312]
We present TiEBe, a dataset of over 23,000 question-answer pairs centered on notable global and regional events.<n>These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments.<n>Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation.
arXiv Detail & Related papers (2025-01-13T16:58:32Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
Causal Representation Learning in Temporal Data via Single-Parent Decoding [66.34294989334728]
Scientific research often seeks to understand the causal structure underlying high-level variables in a system. Scientists typically collect low-level measurements, such as geographically distributed temperature readings. We propose a differentiable method, Causal Discovery with Single-parent Decoding, that simultaneously learns the underlying latents and a causal graph over them.
arXiv Detail & Related papers (2024-10-09T15:57:50Z)
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time [0.0]
In real-world scenarios, the correctness of answers is frequently tied to temporal context.<n>We present a novel framework and dataset spanning over 8,000 events from 2018 to 2024.<n>Our work provides a significant step toward advancing time-aware language models.
arXiv Detail & Related papers (2024-09-20T08:57:20Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z)
Distortions in Judged Spatial Relations in Large Language Models [45.875801135769585]
GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism.
arXiv Detail & Related papers (2024-01-08T20:08:04Z)
GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models. We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods. Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z)
GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE. We collect data from open-released geographic resources and introduce six natural language understanding tasks. We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z)
Geographic Adaptation of Pretrained Language Models [29.81557992080902]
We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the pretrained language models.
arXiv Detail & Related papers (2022-03-16T11:55:00Z)
Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning [49.04866469947569]
We construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region.
arXiv Detail & Related papers (2021-09-14T17:52:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.