UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
- URL: http://arxiv.org/abs/2504.21027v1
- Date: Wed, 23 Apr 2025 13:53:59 GMT
- Title: UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
- Authors: Yu Zheng, Longyi Liu, Yuming Lin, Jie Feng, Guozhen Zhang, Depeng Jin, Yong Li,
- Abstract summary: We introduce a benchmark, UrbanPlanBench, to evaluate the efficacy of Large Language Models (LLMs) in urban planning.<n>We reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards.<n>We present the largest-ever supervised fine-tuning dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks.
- Score: 26.94010977379045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.
Related papers
- Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences.<n>We formulate this as an $L3$ planning problem, emphasizing long context, long instruction, and long output.<n>We introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems.
arXiv Detail & Related papers (2025-06-14T09:37:59Z) - EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios.<n>We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning.<n>Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z) - Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs [59.76268575344119]
We introduce a novel framework for enhancing large language models' (LLMs) planning capabilities by using planning data derived from knowledge graphs (KGs)
LLMs fine-tuned with KG data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval.
arXiv Detail & Related papers (2024-06-20T13:07:38Z) - CityGPT: Empowering Urban Spatial Cognition of Large Language Models [7.40606412920065]
Large language models (LLMs) with powerful language generation and reasoning capabilities have already achieved success in many domains.
However, due to the lacking of physical world's corpus and knowledge during training, they usually fail to solve many real-life tasks in the urban space.
We propose CityGPT, a systematic framework for enhancing the capability of LLMs on understanding urban space and solving the related urban tasks.
arXiv Detail & Related papers (2024-06-20T02:32:16Z) - CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application.<n>In this paper, we design CityBench, an interactive simulator based evaluation platform.<n>We design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench.
arXiv Detail & Related papers (2024-06-20T02:25:07Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - PlanGPT: Enhancing Urban Planning with Tailored Language Model and
Efficient Retrieval [8.345858904808873]
General-purpose large language models often struggle to meet the specific needs of planners.
PlanGPT is the first specialized Large Language Model tailored for urban and spatial planning.
arXiv Detail & Related papers (2024-02-29T15:41:20Z) - Large language model empowered participatory urban planning [5.402147437950729]
This research introduces an innovative urban planning approach integrating Large Language Models (LLMs) within the participatory process.
The framework, based on the crafted LLM agent, consists of role-play, collaborative generation, and feedback, solving a community-level land-use task catering to 1000 distinct interests.
arXiv Detail & Related papers (2024-01-24T10:50:01Z) - LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning [65.86754998249224]
We develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner.
Our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach.
arXiv Detail & Related papers (2023-12-30T02:53:45Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - On the Planning Abilities of Large Language Models (A Critical
Investigation with a Proposed Benchmark) [30.223130782579336]
We develop a benchmark suite based on the kinds of domains employed in the International Planning Competition.
We evaluate LLMs in three modes: autonomous, human-in-the-loop and human-in-the-loop.
Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate.
arXiv Detail & Related papers (2023-02-13T21:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.