Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
- URL: http://arxiv.org/abs/2503.06222v1
- Date: Sat, 08 Mar 2025 13:49:43 GMT
- Title: Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
- Authors: Meng Wang, Fan Wu, Yunchuan Qin, Ruihui Li, Zhuo Tang, Kenli Li,
- Abstract summary: We propose CDScene: Vision-based Robust Semantic Scene Completion via Capturing Dynamic Representations.<n>We leverage a multimodal large-scale model to extract 2D explicit semantics and align them into 3D space.<n>We exploit the characteristics of monocular and stereo depth to decouple scene information into dynamic and static features.
- Score: 37.61183525419993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The vision-based semantic scene completion task aims to predict dense geometric and semantic 3D scene representations from 2D images. However, the presence of dynamic objects in the scene seriously affects the accuracy of the model inferring 3D structures from 2D images. Existing methods simply stack multiple frames of image input to increase dense scene semantic information, but ignore the fact that dynamic objects and non-texture areas violate multi-view consistency and matching reliability. To address these issues, we propose a novel method, CDScene: Vision-based Robust Semantic Scene Completion via Capturing Dynamic Representations. First, we leverage a multimodal large-scale model to extract 2D explicit semantics and align them into 3D space. Second, we exploit the characteristics of monocular and stereo depth to decouple scene information into dynamic and static features. The dynamic features contain structural relationships around dynamic objects, and the static features contain dense contextual spatial information. Finally, we design a dynamic-static adaptive fusion module to effectively extract and aggregate complementary features, achieving robust and accurate semantic scene completion in autonomous driving scenarios. Extensive experimental results on the SemanticKITTI, SSCBench-KITTI360, and SemanticKITTI-C datasets demonstrate the superiority and robustness of CDScene over existing state-of-the-art methods.
Related papers
- Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning [63.94919846010485]
3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views.
We propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI.
We build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning.
arXiv Detail & Related papers (2025-04-23T06:21:11Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.
We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - DENSER: 3D Gaussians Splatting for Scene Reconstruction of Dynamic Urban Environments [0.0]
We propose DENSER, a framework that significantly enhances the representation of dynamic objects.
The proposed approach significantly outperforms state-of-the-art methods by a wide margin.
arXiv Detail & Related papers (2024-09-16T07:11:58Z) - Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering [57.895846642868904]
We present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning.<n>voxelization infers per-object occupancy probabilities at individual spatial locations.<n>Our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids.
arXiv Detail & Related papers (2024-07-30T15:33:58Z) - Shape of Motion: 4D Reconstruction from a Single Video [51.04575075620677]
We introduce a method capable of reconstructing generic dynamic scenes, featuring explicit, full-sequence-long 3D motion.
We exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE3 motion bases.
Our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
arXiv Detail & Related papers (2024-07-18T17:59:08Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - DynaVol: Unsupervised Learning for Dynamic Scenes through Object-Centric
Voxelization [67.85434518679382]
We present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning.
The key idea is to perform object-centric voxelization to capture the 3D nature of the scene.
voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning.
arXiv Detail & Related papers (2023-04-30T05:29:28Z) - Empty Cities: a Dynamic-Object-Invariant Space for Visual SLAM [6.693607456009373]
We present a data-driven approach to obtain the static image of a scene, eliminating dynamic objects that might have been present at the time of traversing the scene with a camera.
We introduce an end-to-end deep learning framework to turn images of an urban environment into realistic static frames suitable for localization and mapping.
arXiv Detail & Related papers (2020-10-15T10:31:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.