OpenCity3D: What do Vision-Language Models know about Urban Environments?
- URL: http://arxiv.org/abs/2503.16776v1
- Date: Fri, 21 Mar 2025 01:11:21 GMT
- Title: OpenCity3D: What do Vision-Language Models know about Urban Environments?
- Authors: Valentin Bieri, Marco Zamboni, Nicolas S. Blumer, Qingxuan Chen, Francis Engelmann,
- Abstract summary: Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving.<n>This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery.<n>We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation.
- Score: 4.468922393753096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D's impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: opencity3d.github.io
Related papers
- EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [63.99937807085461]
3D occupancy prediction provides a comprehensive description of the surrounding scenes.<n>Most existing methods focus on offline perception from one or a few views.<n>We formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it.
arXiv Detail & Related papers (2024-12-05T17:57:09Z) - Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy [68.50785963043161]
GemBench is a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies.
We present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs.
3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
arXiv Detail & Related papers (2024-10-02T09:02:34Z) - CityX: Controllable Procedural Content Generation for Unbounded 3D Cities [50.10101235281943]
Current generative methods fall short in either diversity, controllability, or fidelity.<n>In this work, we resort to the procedural content generation (PCG) technique for high-fidelity generation.<n>We develop a multi-agent framework to transform multi-modal instructions, including OSM, semantic maps, and satellite images, into executable programs.<n>Our method, named CityX, demonstrates its superiority in creating diverse, controllable, and realistic 3D urban scenes.
arXiv Detail & Related papers (2024-07-24T18:05:13Z) - 3D Question Answering for City Scene Understanding [12.433903847890322]
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments.
We introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding.
A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA.
arXiv Detail & Related papers (2024-07-24T16:22:27Z) - CityCraft: A Real Crafter for 3D City Generation [25.7885801163556]
CityCraft is an innovative framework designed to enhance both the diversity and quality of urban scene generation.
Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts.
Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction.
arXiv Detail & Related papers (2024-06-07T14:49:00Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale
Point Cloud Data [15.526523262690965]
We introduce the CityRefer dataset for city-level visual grounding.
The dataset consists of 35k natural language descriptions of 3D objects appearing in SensatUrban city scenes and 5k landmarks labels synchronizing with OpenStreetMap.
arXiv Detail & Related papers (2023-10-28T18:05:32Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - CityDreamer: Compositional Generative Model of Unbounded 3D Cities [44.203932215464214]
CityDreamer is a compositional generative model designed specifically for unbounded 3D cities.
We adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields.
CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities.
arXiv Detail & Related papers (2023-09-01T17:57:02Z) - Building3D: An Urban-Scale Dataset and Benchmarks for Learning Roof
Structures from Point Clouds [4.38301148531795]
Existing datasets for 3D modeling mainly focus on common objects such as furniture or cars.
We present a urban-scale dataset consisting of more than 160 thousands buildings along with corresponding point clouds, mesh and wire-frame models, covering 16 cities in Estonia about 998 Km2.
Experimental results indicate that Building3D has challenges of high intra-class variance, data imbalance and large-scale noises.
arXiv Detail & Related papers (2023-07-21T21:38:57Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.