PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
- URL: http://arxiv.org/abs/2505.14481v2
- Date: Wed, 21 May 2025 05:04:58 GMT
- Title: PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
- Authors: He Zhu, Junyou Su, Minxin Chen, Wen Wang, Yijie Deng, Guanhua Chen, Wenjia Zhang,
- Abstract summary: We introduce PlanGPT-VL, the first domain-specific Vision-Language Model tailored specifically for urban planning maps.<n>PlanGPT-VL employs three innovative approaches: (1) PlanAnno-V framework for high-quality VQA data synthesis, (2) Critical Point Thinking to reduce hallucinations through structured verification, and (3) comprehensive training methodology combining Supervised Fine-Tuning with frozen vision encoder parameters.
- Score: 10.56421857293621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of urban planning, existing Vision-Language Models (VLMs) frequently fail to effectively analyze and evaluate planning maps, despite the critical importance of these visual elements for urban planners and related educational contexts. Planning maps, which visualize land use, infrastructure layouts, and functional zoning, require specialized understanding of spatial configurations, regulatory requirements, and multi-scale analysis. To address this challenge, we introduce PlanGPT-VL, the first domain-specific Vision-Language Model tailored specifically for urban planning maps. PlanGPT-VL employs three innovative approaches: (1) PlanAnno-V framework for high-quality VQA data synthesis, (2) Critical Point Thinking to reduce hallucinations through structured verification, and (3) comprehensive training methodology combining Supervised Fine-Tuning with frozen vision encoder parameters. Through systematic evaluation on our proposed PlanBench-V benchmark, we demonstrate that PlanGPT-VL significantly outperforms general-purpose state-of-the-art VLMs in specialized planning map interpretation tasks, offering urban planning professionals a reliable tool for map analysis, assessment, and educational applications while maintaining high factual accuracy. Our lightweight 7B parameter model achieves comparable performance to models exceeding 72B parameters, demonstrating efficient domain specialization without sacrificing performance.
Related papers
- Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z) - Evaluating Vision-Language Models as Evaluators in Path Planning [13.391755396500155]
Large language models (LLMs) have been shown to have limited effectiveness in end-to-end planning.<n>We introduce PathEval, a novel benchmark evaluating VLMs as plan evaluators in complex path-planning scenarios.<n>Our analysis reveals that these models face significant challenges on the benchmark.
arXiv Detail & Related papers (2024-11-27T19:32:03Z) - Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos.
It employs a breadth-first search algorithm for optimal plan generation.
Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z) - On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks.
Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Socratic Planner: Self-QA-Based Zero-Shot Planning for Embodied Instruction Following [17.608330952846075]
Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in interactive environments.<n>A key challenge in EIF is compositional task planning, typically addressed through supervised learning or few-shot in-context learning with labeled data.<n>We introduce the Socratic Planner, a self-QA-based zero-shot planning method that infers an appropriate plan without any further training.
arXiv Detail & Related papers (2024-04-21T08:10:20Z) - PlanGPT: Enhancing Urban Planning with Tailored Language Model and
Efficient Retrieval [8.345858904808873]
General-purpose large language models often struggle to meet the specific needs of planners.
PlanGPT is the first specialized Large Language Model tailored for urban and spatial planning.
arXiv Detail & Related papers (2024-02-29T15:41:20Z) - PAS-SLAM: A Visual SLAM System for Planar Ambiguous Scenes [41.47703182059505]
We propose a visual SLAM system based on planar features designed for planar ambiguous scenes.
We present an integrated data association strategy that combines plane parameters, semantic information, projection IoU, and non-parametric tests.
Finally, we design a set of multi-constraint factor graphs for camera pose optimization.
arXiv Detail & Related papers (2024-02-09T01:34:26Z) - Probabilistic contingent planning based on HTN for high-quality plans [8.23558342809427]
We propose a contingent Hierarchical Task Network (HTN) planner, named High-Quality Contingent Planner (HQCP)
HQCP generates high-quality plans in the partially observable environment.
The formalisms in HTN planning are extended into partial observability and are evaluated regarding the cost.
arXiv Detail & Related papers (2023-08-14T03:55:14Z) - EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [95.37585041654535]
Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI.
Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-24T11:04:30Z) - Human-instructed Deep Hierarchical Generative Learning for Automated
Urban Planning [57.91323079939641]
We develop a novel human-instructed deep hierarchical generative model to generate optimal urban plans.
The first stage is to label the grids of a target area with latent functionalities to discover functional zones.
The second stage is to perceive the planning requirements to form urban functionality projections.
The third stage is to leverage multi-attentions to model the zone-zone peer dependencies of the functionality projections to generate grid-level land-use configurations.
arXiv Detail & Related papers (2022-12-01T23:06:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.