Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels
- URL: http://arxiv.org/abs/2504.21040v1
- Date: Mon, 28 Apr 2025 09:41:17 GMT
- Title: Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels
- Authors: Chenyi Cai, Kosuke Kuriyama, Youlong Gu, Filip Biljecki, Pieter Herthogs,
- Abstract summary: Urban environments are vital to supporting human activity in public spaces.<n>The emergence of big data, such as street view images (SVIs) combined with large language models (MLLMs) is transforming how researchers and practitioners investigate, measure, and evaluate urban environments.<n>This study explores the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Urban street environments are vital to supporting human activity in public spaces. The emergence of big data, such as street view images (SVIs) combined with multimodal large language models (MLLMs), is transforming how researchers and practitioners investigate, measure, and evaluate semantic and visual elements of urban environments. Considering the low threshold for creating automated evaluative workflows using MLLMs, it is crucial to explore both the risks and opportunities associated with these probabilistic models. In particular, the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design has not been fully explored. This study sets out an initial exploration of how integrating more formal and structured representations of expert urban design knowledge into the input prompts of an MLLM (ChatGPT-4) can enhance the model's capability and reliability in evaluating the walkability of built environments using SVIs. We collect walkability metrics from the existing literature and categorize them using relevant ontologies. We then select a subset of these metrics, focusing on the subthemes of pedestrian safety and attractiveness, and develop prompts for the MLLM accordingly. We analyze the MLLM's ability to evaluate SVI walkability subthemes through prompts with varying levels of clarity and specificity regarding evaluation criteria. Our experiments demonstrate that MLLMs are capable of providing assessments and interpretations based on general knowledge and can support the automation of multimodal image-text evaluations. However, they generally provide more optimistic scores and can make mistakes when interpreting the provided metrics, resulting in incorrect evaluations. By integrating expert knowledge, the MLLM's evaluative performance exhibits higher consistency and concentration.
Related papers
- EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.
Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.
Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z) - Surveying the MLLM Landscape: A Meta-Review of Current Surveys [17.372501468675303]
Multimodal Large Language Models (MLLMs) have become a transformative force in the field of artificial intelligence.
This survey aims to provide a systematic review of benchmark tests and evaluation methods for MLLMs.
arXiv Detail & Related papers (2024-09-17T14:35:38Z) - A Survey on Evaluation of Multimodal Large Language Models [11.572066870077888]
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs)
This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI)
With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions.
arXiv Detail & Related papers (2024-08-28T13:05:55Z) - Visualization Literacy of Multimodal Large Language Models: A Comparative Study [12.367399155606162]
multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context.
Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language.
In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs.
arXiv Detail & Related papers (2024-06-24T17:52:16Z) - AesBench: An Expert Benchmark for Multimodal Large Language Models on
Image Aesthetics Perception [64.25808552299905]
AesBench is an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.
We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts.
We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI)
arXiv Detail & Related papers (2024-01-16T10:58:07Z) - TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for
Human-Aligned LLMs [35.717370285231176]
Large language models (LLMs) have shown impressive capabilities across various natural language tasks.
We propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks.
arXiv Detail & Related papers (2023-11-09T13:58:59Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.