Related papers: CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

URL: http://arxiv.org/abs/2406.13945v2
Date: Mon, 23 Dec 2024 14:10:09 GMT
Title: CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks
Authors: Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li,
Abstract summary: Large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application.<n>In this paper, we design CityBench, an interactive simulator based evaluation platform.<n>We design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench.
Score: 10.22654338686634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application. A systematic and reliable evaluation of LLMs or vision-language model (VLMs) is a crucial step in applying and developing them for various fields. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design CityBench, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build CityData to integrate the diverse urban data and CitySimu to simulate fine-grained urban dynamics. Based on CityData and CitySimu, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level reasoning abilities, e.g., geospatial prediction and traffic control task. These observations provide valuable perspectives for utilizing and developing LLMs in the future. Codes are openly accessible via https://github.com/tsinghua-fib-lab/CityBench.

Related papers

Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications [11.994794218481122]
Large Language Models (LLMs) have opened new ways toward realizing the vision of intelligent cities.<n>In this article, we focus on Urban LLM Agents, which are semi-embodied within the hybrid cyber-physical-social space of cities.
arXiv Detail & Related papers (2025-07-01T16:18:29Z)
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding [5.312363883238377]
We introduce $textitUrbanLLaVA$, a multi-modal large language model to process multi-modal data simultaneously.<n>We propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning.<n> Experimental results from three cities demonstrate that $textitUrbanLLaVA$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks.
arXiv Detail & Related papers (2025-06-29T13:04:27Z)
USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning of LLMs as Urban Agents [6.054990893127997]
Large language models (LLMs) have shown emerging potential intemporal, reasoning making them promising candidates for building urban agents that support diverse urban downstream applications.<n>Existing studies on evaluating urban agents on outcome-level studies offer limited insight into their underlying reasoning processes.<n>As a result, strengths and limitations of urban agents intemporal reasoning remain poorly understood.<n>USTBench is the first benchmark to evaluate LLMs'temporal reasoning abilities as urban agents across four dimensions:temporal understanding, forecasting, planning, and reflection with feedback.
arXiv Detail & Related papers (2025-05-23T07:30:57Z)
Urban Computing in the Era of Large Language Models [41.50492781046065]
This survey explores the intersection of Large Language Models (LLMs) and urban computing. We provide a concise overview of the evolution and core technologies of LLMs. We survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring.
arXiv Detail & Related papers (2025-04-02T05:12:13Z)
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap [51.198001060683296]
Large Language Models (LLMs) offer transformative potential to address transportation challenges. This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation. For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization.
arXiv Detail & Related papers (2025-03-27T11:56:27Z)
Collaborative Imputation of Urban Time Series through Cross-city Meta-learning [54.438991949772145]
We propose a novel collaborative imputation paradigm leveraging meta-learned implicit neural representations (INRs) We then introduce a cross-city collaborative learning scheme through model-agnostic meta learning. Experiments on a diverse urban dataset from 20 global cities demonstrate our model's superior imputation performance and generalizability.
arXiv Detail & Related papers (2025-01-20T07:12:40Z)
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks [100.3234156027118]
We present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning.
arXiv Detail & Related papers (2024-12-24T06:03:42Z)
What can LLM tell us about cities? [6.405546719612814]
This study explores the capabilities of large language models (LLMs) in providing knowledge about cities and regions on a global scale. Experiments reveal that LLMs embed a broad but varying degree of knowledge across global cities, with ML models trained on LLM-derived features consistently leading to improved predictive accuracy.
arXiv Detail & Related papers (2024-11-25T09:07:56Z)
OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents [10.919679349212426]
Large Language Models (LLMs) have led to the development of LLM agents capable of simulating urban activities with unprecedented realism. We propose OpenCity, a scalable simulation platform optimized for both system and prompt efficiencies. OpenCity achieves a 600-fold acceleration in simulation time per agent, a 70% reduction in LLM requests, and a 50% reduction in token usage.
arXiv Detail & Related papers (2024-10-11T13:52:35Z)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios. UrBench contains 11.6K meticulously curated questions at both region-level and role-level. Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z)
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z)
CityGPT: Empowering Urban Spatial Cognition of Large Language Models [7.40606412920065]
Large language models (LLMs) with powerful language generation and reasoning capabilities have already achieved success in many domains. However, due to the lacking of physical world's corpus and knowledge during training, they usually fail to solve many real-life tasks in the urban space. We propose CityGPT, a systematic framework for enhancing the capability of LLMs on understanding urban space and solving the related urban tasks.
arXiv Detail & Related papers (2024-06-20T02:32:16Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
UrbanLLM: Autonomous Urban Activity Planning and Management with Large Language Models [20.069378890478763]
UrbanLLM is a problem-solver by decomposing urban-related queries into manageable sub-tasks. It identifies suitable AI models for each sub-task, and generates comprehensive responses to the given queries.
arXiv Detail & Related papers (2024-06-18T07:41:42Z)
Urban Generative Intelligence (UGI): A Foundational Platform for Agents in Embodied City Environment [32.53845672285722]
Urban environments, characterized by their complex, multi-layered networks, face significant challenges in the face of rapid urbanization. Recent developments in big data, artificial intelligence, urban computing, and digital twins have laid the groundwork for sophisticated city modeling and simulation. This paper proposes Urban Generative Intelligence (UGI), a novel foundational platform integrating Large Language Models (LLMs) into urban systems.
arXiv Detail & Related papers (2023-12-19T03:12:13Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
The Urban Toolkit: A Grammar-based Framework for Urban Visual Analytics [5.674216760436341]
The complex nature of urban issues and the overwhelming amount of available data have posed significant challenges in translating these efforts into actionable insights. When analyzing a feature of interest, an urban expert must transform, integrate, and visualize different thematic (e.g., sunlight access, demographic) and physical (e.g., buildings, street networks) data layers. This makes the entire visual data exploration and system implementation difficult for programmers and also sets a high entry barrier for urban experts outside of computer science.
arXiv Detail & Related papers (2023-08-15T13:43:04Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA) We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z)
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub) Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark) [30.223130782579336]
We develop a benchmark suite based on the kinds of domains employed in the International Planning Competition. We evaluate LLMs in three modes: autonomous, human-in-the-loop and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate.
arXiv Detail & Related papers (2023-02-13T21:37:41Z)
Methodological Foundation of a Numerical Taxonomy of Urban Form [62.997667081978825]
We present a method for numerical taxonomy of urban form derived from biological systematics. We derive homogeneous urban tissue types and, by determining overall morphological similarity between them, generate a hierarchical classification of urban form. After framing and presenting the method, we test it on two cities - Prague and Amsterdam.
arXiv Detail & Related papers (2021-04-30T12:47:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.