USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents
- URL: http://arxiv.org/abs/2505.17572v1
- Date: Fri, 23 May 2025 07:30:57 GMT
- Title: USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents
- Authors: Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, Hao Liu,
- Abstract summary: Large language models (LLMs) have shown emerging potential intemporal, reasoning making them promising candidates for building urban agents that support diverse urban downstream applications.<n>Existing studies on evaluating urban agents on outcome-level studies offer limited insight into their underlying reasoning processes.<n>As a result, strengths and limitations of urban agents intemporal reasoning remain poorly understood.<n>USTBench is the first benchmark to evaluate LLMs'temporal reasoning abilities as urban agents across four dimensions:temporal understanding, forecasting, planning, and reflection with feedback.
- Score: 6.054990893127997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs' spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection with feedback. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-level evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of thirteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications.
Related papers
- Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications [11.994794218481122]
Large Language Models (LLMs) have opened new ways toward realizing the vision of intelligent cities.<n>In this article, we focus on Urban LLM Agents, which are semi-embodied within the hybrid cyber-physical-social space of cities.
arXiv Detail & Related papers (2025-07-01T16:18:29Z) - Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z) - UrbanMind: Urban Dynamics Prediction with Multifaceted Spatial-Temporal Large Language Models [18.051209616917042]
UrbanMind is a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction.<n>At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies.<n>Experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-16T19:38:06Z) - UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models [26.94010977379045]
We introduce a benchmark, UrbanPlanBench, to evaluate the efficacy of Large Language Models (LLMs) in urban planning.<n>We reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards.<n>We present the largest-ever supervised fine-tuning dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks.
arXiv Detail & Related papers (2025-04-23T13:53:59Z) - Large Reasoning Models in Agent Scenarios: Exploring the Necessity of Reasoning Capabilities [74.35956310688164]
We propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving.<n>Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes.<n>LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies.
arXiv Detail & Related papers (2025-03-14T04:34:31Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z) - CityGPT: Empowering Urban Spatial Cognition of Large Language Models [7.40606412920065]
Large language models (LLMs) with powerful language generation and reasoning capabilities have already achieved success in many domains.
However, due to the lacking of physical world's corpus and knowledge during training, they usually fail to solve many real-life tasks in the urban space.
We propose CityGPT, a systematic framework for enhancing the capability of LLMs on understanding urban space and solving the related urban tasks.
arXiv Detail & Related papers (2024-06-20T02:32:16Z) - CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application.<n>In this paper, we design CityBench, an interactive simulator based evaluation platform.<n>We design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench.
arXiv Detail & Related papers (2024-06-20T02:25:07Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.