Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models
- URL: http://arxiv.org/abs/2511.06201v1
- Date: Sun, 09 Nov 2025 03:24:10 GMT
- Title: Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models
- Authors: Rodrigo Gallardo, Oz Fishman, Alexander Htet Kyaw,
- Abstract summary: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space.
- Score: 41.99844472131922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.
Related papers
- GOPLA: Generalizable Object Placement Learning via Synthetic Augmentation of Human Arrangement [16.549660613125877]
GOPLA is a hierarchical framework that learns generalizable object placement from augmented human demonstrations.<n>To overcome data scarcity, we introduce a scalable pipeline that expands human placement demonstrations into diverse synthetic training data.<n>Our approach improves placement success rates by 30.04 percentage points over the runner-up, evaluated on positioning accuracy and physical plausibility.
arXiv Detail & Related papers (2025-10-16T12:38:14Z) - MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes [49.89767522399176]
Group-level social interactions in public spaces are crucial for urban planning.<n>We introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by interpersonal relations.<n>We propose MINGLE, a modular three-stage pipeline that integrates human detection and depth estimation, VLM-based reasoning to classify pairwise social affiliation, and a lightweight spatial aggregation algorithm to localize socially connected groups.<n>We present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups.
arXiv Detail & Related papers (2025-09-16T19:31:40Z) - Hestia: Hierarchical Next-Best-View Exploration for Systematic Intelligent Autonomous Data Collection [23.427212631082025]
This study introduces Hierarchical Next-Best-View Exploration for Systematic Intelligent Autonomous Data Collection (Hestia)<n>Hestia systematically defines the next-best-view task by proposing core components such as dataset choice, observation design, action space, reward calculation, and learning schemes.<n> Experimental results show that Hestia performs robustly across three datasets translated and object settings in the NVIDIA IsaacLab environment.
arXiv Detail & Related papers (2025-08-01T18:27:23Z) - Interest Networks (iNETs) for Cities: Cross-Platform Insights and Urban Behavior Explanations [0.0]
Location-Based Social Networks (LBSNs) provide a rich foundation for modeling urban behavior through iNETs (Interest Networks)<n>This study compares iNETs across platforms (Google Places and Foursquare) and spatial granularities, showing that coarser levels reveal more consistent cross-platform patterns.<n>We develop a multi-level, explainable recommendation system that predicts high-interest urban regions for different user types.
arXiv Detail & Related papers (2025-07-07T13:34:15Z) - Generative AI for Urban Planning: Synthesizing Satellite Imagery via Diffusion Models [9.385767746826286]
We adapt a state-of-the-art Stable Diffusion model, extended with ControlNet, to generate high-fidelity satellite imagery conditioned on land use descriptions, infrastructure, and natural environments.<n>Using data from three major U.S. cities, we demonstrate that the proposed diffusion model generates realistic and diverse urban landscapes by varying land-use configurations, road networks, and water bodies.<n>Our model achieves high FID and KID scores and demonstrates robustness across diverse urban contexts.
arXiv Detail & Related papers (2025-05-13T04:55:38Z) - MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility [52.0930915607703]
Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans.
Micromobility enabled by AI for short-distance travel in public urban spaces plays a crucial component in the future transportation system.
We present MetaUrban, a compositional simulation platform for the AI-driven urban micromobility research.
arXiv Detail & Related papers (2024-07-11T17:56:49Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - Human-instructed Deep Hierarchical Generative Learning for Automated
Urban Planning [57.91323079939641]
We develop a novel human-instructed deep hierarchical generative model to generate optimal urban plans.
The first stage is to label the grids of a target area with latent functionalities to discover functional zones.
The second stage is to perceive the planning requirements to form urban functionality projections.
The third stage is to leverage multi-attentions to model the zone-zone peer dependencies of the functionality projections to generate grid-level land-use configurations.
arXiv Detail & Related papers (2022-12-01T23:06:41Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Reimagining City Configuration: Automated Urban Planning via Adversarial
Learning [28.930624100994514]
Urban planning refers to the efforts of designing land-use configurations.
Recent advance of deep learning motivates us to ask: can machines learn at a human capability to automatically and quickly calculate land-use configuration.
arXiv Detail & Related papers (2020-08-22T21:15:39Z) - Future Urban Scenes Generation Through Vehicles Synthesis [90.1731992199415]
We propose a deep learning pipeline to predict the visual future appearance of an urban scene.
We follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently.
We show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow.
arXiv Detail & Related papers (2020-07-01T08:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.