Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
- URL: http://arxiv.org/abs/2506.05087v1
- Date: Thu, 05 Jun 2025 14:34:04 GMT
- Title: Interpretable Multimodal Framework for Human-Centered Street Assessment: Integrating Visual-Language Models for Perceptual Urban Diagnostics
- Authors: HaoTian Lan,
- Abstract summary: This study introduces a novel Multimodal Street Evaluation Framework (MSEF)<n>We fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation.<n>The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While objective street metrics derived from imagery or GIS have become standard in urban analytics, they remain insufficient to capture subjective perceptions essential to inclusive urban design. This study introduces a novel Multimodal Street Evaluation Framework (MSEF) that fuses a vision transformer (VisualGLM-6B) with a large language model (GPT-4), enabling interpretable dual-output assessment of streetscapes. Leveraging over 15,000 annotated street-view images from Harbin, China, we fine-tune the framework using LoRA and P-Tuning v2 for parameter-efficient adaptation. The model achieves an F1 score of 0.84 on objective features and 89.3 percent agreement with aggregated resident perceptions, validated across stratified socioeconomic geographies. Beyond classification accuracy, MSEF captures context-dependent contradictions: for instance, informal commerce boosts perceived vibrancy while simultaneously reducing pedestrian comfort. It also identifies nonlinear and semantically contingent patterns -- such as the divergent perceptual effects of architectural transparency across residential and commercial zones -- revealing the limits of universal spatial heuristics. By generating natural-language rationales grounded in attention mechanisms, the framework bridges sensory data with socio-affective inference, enabling transparent diagnostics aligned with SDG 11. This work offers both methodological innovation in urban perception modeling and practical utility for planning systems seeking to reconcile infrastructural precision with lived experience.
Related papers
- Demographics-Informed Neural Network for Multi-Modal Spatiotemporal forecasting of Urban Growth and Travel Patterns Using Satellite Imagery [3.378738346115004]
This study presents a novel demographics informed deep learning framework designed to forecast urban spatial transformations.<n>The proposed model employs an encoder-decoder architecture with temporal gated residual connections, integrating satellite imagery and demographic data.<n>The study also contributes a comprehensive multimodal dataset pairing satellite imagery sequences with corresponding demographic and travel behavior attributes.
arXiv Detail & Related papers (2025-06-14T11:33:40Z) - Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Image-based Visibility Analysis Replacing Line-of-Sight Simulation: An Urban Landmark Perspective [2.3315115235829342]
The study challenges the traditional LoS-based approaches by introducing a new, image-based visibility analysis method.<n>In the first case study, the method proves its reliability in detecting the visibility of six tall landmark constructions in global cities, with an overall accuracy of 87%.<n>In the second case, the proposed visibility graph uncovers the form and strength of connections for multiple landmarks along the River Thames in London.
arXiv Detail & Related papers (2025-05-17T03:41:45Z) - Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop.<n>The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z) - Collaborative Imputation of Urban Time Series through Cross-city Meta-learning [54.438991949772145]
We propose a novel collaborative imputation paradigm leveraging meta-learned implicit neural representations (INRs)<n>We then introduce a cross-city collaborative learning scheme through model-agnostic meta learning.<n>Experiments on a diverse urban dataset from 20 global cities demonstrate our model's superior imputation performance and generalizability.
arXiv Detail & Related papers (2025-01-20T07:12:40Z) - Potential Field as Scene Affordance for Behavior Change-Based Visual Risk Object Identification [4.896236083290351]
We study behavior change-based visual risk object identification (Visual-ROI)<n>Existing methods often show significant limitations in spatial accuracy and temporal consistency.<n>We propose a new framework with a Bird's Eye View representation to overcome these challenges.
arXiv Detail & Related papers (2024-09-24T08:17:50Z) - RING#: PR-by-PE Global Localization with Roto-translation Equivariant Gram Learning [20.688641105430467]
Global localization is crucial in autonomous driving and robotics applications when GPS signals are unreliable.
Most approaches achieve global localization by sequential place recognition (PR) and pose estimation (PE)
We introduce a new paradigm, PR-by-PE localization, which bypasses the need for separate place recognition by directly deriving it from pose estimation.
We propose RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors.
arXiv Detail & Related papers (2024-08-30T18:42:53Z) - Heterogeneous Graph Neural Networks with Post-hoc Explanations for Multi-modal and Explainable Land Use Inference [11.753345219488745]
We introduce an explainable framework for inferring land use that synergises heterogeneous graph neural networks (HGNs) with Explainable AI techniques.
Experiments demonstrate that the proposed HGNs significantly outperform baseline graph neural networks for all six land-use indicators.
These analyses demonstrate that the proposed HGNs can suitably support urban stakeholders in their urban planning and policy-making.
arXiv Detail & Related papers (2024-06-19T17:39:10Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Reinforcement Learning with Human Feedback for Realistic Traffic
Simulation [53.85002640149283]
Key element of effective simulation is the incorporation of realistic traffic models that align with human knowledge.
This study identifies two main challenges: capturing the nuances of human preferences on realism and the unification of diverse traffic simulation models.
arXiv Detail & Related papers (2023-09-01T19:29:53Z) - Mitigating Urban-Rural Disparities in Contrastive Representation Learning with Satellite Imagery [19.93324644519412]
We consider the risk of urban-rural disparities in identification of land-cover features.
We propose fair dense representation with contrastive learning (FairDCL) as a method for de-biasing the multi-level latent space of convolution neural network models.
The obtained image representation mitigates downstream urban-rural prediction disparities and outperforms state-of-the-art baselines on real-world satellite images.
arXiv Detail & Related papers (2022-11-16T04:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.