From Static to Dynamic: Evaluating the Perceptual Impact of Dynamic Elements in Urban Scenes via MLLM-Guided Generative Inpainting
- URL: http://arxiv.org/abs/2512.24513v2
- Date: Thu, 01 Jan 2026 21:40:37 GMT
- Title: From Static to Dynamic: Evaluating the Perceptual Impact of Dynamic Elements in Urban Scenes via MLLM-Guided Generative Inpainting
- Authors: Zhiwei Wei, Mengzi Zhang, Boyan Lu, Zhitao Deng, Nai Yang, Hua Liao,
- Abstract summary: Most existing studies treat urban scenes as static and largely ignore the role of dynamic elements such as pedestrians and vehicles.<n>We propose a framework that isolates the perceptual effects of dynamic elements using semantic segmentation and MLLM guided generative inpainting.<n>We trained 11 machine learning models using multimodal visual features and identified that lighting conditions, human presence, and depth variation were key factors driving perceptual change.
- Score: 1.3947005195255644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding urban perception from street view imagery has become a central topic in urban analytics and human centered urban design. However, most existing studies treat urban scenes as static and largely ignore the role of dynamic elements such as pedestrians and vehicles, raising concerns about potential bias in perception based urban analysis. To address this issue, we propose a controlled framework that isolates the perceptual effects of dynamic elements by constructing paired street view images with and without pedestrians and vehicles using semantic segmentation and MLLM guided generative inpainting. Based on 720 paired images from Dongguan, China, a perception experiment was conducted in which participants evaluated original and edited scenes across six perceptual dimensions. The results indicate that removing dynamic elements leads to a consistent 30.97% decrease in perceived vibrancy, whereas changes in other dimensions are more moderate and heterogeneous. To further explore the underlying mechanisms, we trained 11 machine learning models using multimodal visual features and identified that lighting conditions, human presence, and depth variation were key factors driving perceptual change. At the individual level, 65% of participants exhibited significant vibrancy changes, compared with 35-50% for other dimensions; gender further showed a marginal moderating effect on safety perception. Beyond controlled experiments, the trained model was extended to a city-scale dataset to predict vibrancy changes after the removal of dynamic elements. The city level results reveal that such perceptual changes are widespread and spatially structured, affecting 73.7% of locations and 32.1% of images, suggesting that urban perception assessments based solely on static imagery may substantially underestimate urban liveliness.
Related papers
- LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions [80.70675855203154]
Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins.<n>Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects.<n>We present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects.
arXiv Detail & Related papers (2026-02-01T09:37:00Z) - Dynamic Avatar-Scene Rendering from Human-centric Context [75.95641456716373]
We propose bf Separate-then-Map (StM) strategy to bridge separately defined and optimized models.<n>StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy.
arXiv Detail & Related papers (2025-11-13T17:39:06Z) - Do Street View Imagery and Public Participation GIS align: Comparative Analysis of Urban Attractiveness [0.0]
Street View Imagery (SVI) and Public Participation GIS (PPGIS) represent two prominent approaches for capturing place-based perceptions.<n>This study investigates the alignment between SVI-based perceived attractiveness and residents' reported experiences gathered via a city-wide PPGIS survey in Helsinki, Finland.
arXiv Detail & Related papers (2025-11-04T12:40:12Z) - RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation [67.38036090822982]
We propose RoboView-Bias, the first benchmark specifically designed to quantify visual bias in robotic manipulation.<n>We create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions.<n>Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.
arXiv Detail & Related papers (2025-09-26T13:53:25Z) - UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective [26.682345246235766]
UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions.<n> Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels.
arXiv Detail & Related papers (2025-09-26T11:38:57Z) - PIGUIQA: A Physical Imaging Guided Perceptual Framework for Underwater Image Quality Assessment [59.9103803198087]
We propose a Physical Imaging Guided perceptual framework for Underwater Image Quality Assessment (UIQA)<n>By leveraging underwater radiative transfer theory, we integrate physics-based imaging estimations to establish quantitative metrics for these distortions.<n>The proposed model accurately predicts image quality scores and achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-12-20T03:31:45Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - CityPulse: Fine-Grained Assessment of Urban Change with Street View Time
Series [12.621355888239359]
Urban transformations have profound societal impact on both individuals and communities at large.
We propose an end-to-end change detection model to effectively capture physical alterations in the built environment at scale.
Our approach has the potential to supplement existing dataset and serve as a fine-grained and accurate assessment of urban change.
arXiv Detail & Related papers (2024-01-02T08:57:09Z) - City-Wide Perceptions of Neighbourhood Quality using Street View Images [5.340189314359048]
This paper describes our methodology, based in London, including collection of images and ratings, web development, model training and mapping.
Perceived neighbourhood quality is a core component of urban vitality, influencing social cohesion, sense of community, safety, activity and mental health of residents.
arXiv Detail & Related papers (2022-11-22T10:16:35Z) - Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance.
We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z) - Learning Motion-Dependent Appearance for High-Fidelity Rendering of
Dynamic Humans from a Single Camera [49.357174195542854]
A key challenge of learning the dynamics of the appearance lies in the requirement of a prohibitively large amount of observations.
We show that our method can generate a temporally coherent video of dynamic humans for unseen body poses and novel views given a single view video.
arXiv Detail & Related papers (2022-03-24T00:22:03Z) - ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object
Manipulation [135.10594078615952]
We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects.
A benchmark contains over 17,000 action trajectories with six types of plush toys and 78 variants.
Our model achieves the best performance in geometry, correspondence, and dynamics predictions.
arXiv Detail & Related papers (2022-03-14T04:56:55Z) - Quantifying urban streetscapes with deep learning: focus on aesthetic
evaluation [4.129225533930966]
This paper reports the performance of our deep learning model on a unique data set prepared in Tokyo to recognize the areas covered by facades and billboards in streetscapes.
The model achieved 63.17 % of accuracy, measured by Intersection-over-Union (IoU)
arXiv Detail & Related papers (2021-06-29T12:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.