CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation
- URL: http://arxiv.org/abs/2510.13245v2
- Date: Thu, 16 Oct 2025 03:29:06 GMT
- Title: CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation
- Authors: Li Liang, Bo Miao, Xinyu Wang, Naveed Akhtar, Jordan Vice, Ajmal Mian,
- Abstract summary: We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from freehand sketches and satellite images.<n>We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in scene generation.<n>CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization.
- Score: 55.74642848285121
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff
Related papers
- SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences [12.771171646896468]
We introduce SceneLinker, a framework that generates compositional 3D scenes via semantic scene graph from RGB sequences.<n>Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial Mixed Reality (MR) content.
arXiv Detail & Related papers (2026-02-03T01:22:07Z) - Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation [14.377332218510743]
Top2Ground is a novel diffusion-based method that directly generates ground-view images from aerial input images.<n>We condition the denoising process on a joint representation of VAE-encoded spatial features.<n>Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
arXiv Detail & Related papers (2025-11-11T13:53:07Z) - SPATIALGEN: Layout-guided 3D Indoor Scene Generation [37.30623176278608]
We present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes.<n>Given a 3D layout and a reference image, our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints.<n>We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.
arXiv Detail & Related papers (2025-09-18T14:12:32Z) - EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis [61.1662426227688]
Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization.<n>We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner.
arXiv Detail & Related papers (2025-03-26T02:47:27Z) - GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency.<n>Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.<n>FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion [77.34078223594686]
We propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques.
Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner.
Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
arXiv Detail & Related papers (2024-01-19T16:15:37Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.