Related papers: WalkCLIP: Multimodal Learning for Urban Walkability Prediction

WalkCLIP: Multimodal Learning for Urban Walkability Prediction

URL: http://arxiv.org/abs/2511.21947v1
Date: Wed, 26 Nov 2025 22:15:33 GMT
Title: WalkCLIP: Multimodal Learning for Urban Walkability Prediction
Authors: Shilong Xiang, JangHyeon Lee, Min Namgung, Yao-Yi Chiang,
Abstract summary: Urban walkability is a cornerstone of public health, sustainability, and quality of life.<n>Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability.<n>We introduce WalkCLIP, a framework that integrates these complementary viewpoints to predict urban walkability.
Score: 4.403122905236942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Urban walkability is a cornerstone of public health, sustainability, and quality of life. Traditional walkability assessments rely on surveys and field audits, which are costly and difficult to scale. Recent studies have used satellite imagery, street view imagery, or population indicators to estimate walkability, but these single-source approaches capture only one dimension of the walking environment. Satellite data describe the built environment from above, but overlook the pedestrian perspective. Street view imagery captures conditions at the ground level, but lacks broader spatial context. Population dynamics reveal patterns of human activity but not the visual form of the environment. We introduce WalkCLIP, a multimodal framework that integrates these complementary viewpoints to predict urban walkability. WalkCLIP learns walkability-aware vision-language representations from GPT-4o generated image captions, refines these representations with a spatial aggregation module that incorporates neighborhood context, and fuses the resulting features with representations from a population dynamics foundation model. Evaluated at 4,660 locations throughout Minneapolis-Saint Paul, WalkCLIP outperforms unimodal and multimodal baselines in both predictive accuracy and spatial alignment. These results show that the integration of visual and behavioral signals yields reliable predictions of the walking environment.

Related papers

Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes [0.9208007322096533]
This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence.<n>It is a modular workflow for scoring street-level urban scenes using open-access data and vision-language models.<n>It operates without task-specific training or proprietary software dependencies.
arXiv Detail & Related papers (2025-04-23T09:08:06Z)
StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model [12.789465279993864]
Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health. We propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources. The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris.
arXiv Detail & Related papers (2024-11-19T05:15:19Z)
Multimodal Contrastive Learning of Urban Space Representations from POI Data [2.695321027513952]
CaLLiPer (Contrastive Language-Location Pre-training) is a representation learning model that embeds continuous urban spaces into vector representations. We validate CaLLiPer's effectiveness by applying it to learning urban space representations in London, UK.
arXiv Detail & Related papers (2024-11-09T16:24:07Z)
Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper) [8.071443524030302]
Equitable urban transportation applications require high-fidelity digital representations of the built environment. We consider vision language models as a mechanism for annotating diverse urban features from satellite images. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy.
arXiv Detail & Related papers (2024-08-01T21:50:23Z)
CityX: Controllable Procedural Content Generation for Unbounded 3D Cities [50.10101235281943]
Current generative methods fall short in either diversity, controllability, or fidelity.<n>In this work, we resort to the procedural content generation (PCG) technique for high-fidelity generation.<n>We develop a multi-agent framework to transform multi-modal instructions, including OSM, semantic maps, and satellite images, into executable programs.<n>Our method, named CityX, demonstrates its superiority in creating diverse, controllable, and realistic 3D urban scenes.
arXiv Detail & Related papers (2024-07-24T18:05:13Z)
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility [52.0930915607703]
Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system. We present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research.
arXiv Detail & Related papers (2024-07-11T17:56:49Z)
Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.<n>Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
Conditioned Human Trajectory Prediction using Iterative Attention Blocks [70.36888514074022]
We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models.
arXiv Detail & Related papers (2022-06-29T07:49:48Z)
TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks. To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame. Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal Urban Neighborhood Embedding [8.396746290518102]
Urban2Vec is an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest data. We show that Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks.
arXiv Detail & Related papers (2020-01-29T21:30:53Z)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.<n>One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.<n>We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.