Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
- URL: http://arxiv.org/abs/2306.15670v2
- Date: Wed, 22 Nov 2023 08:49:44 GMT
- Title: Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
- Authors: Haoyi Jiang and Tianheng Cheng and Naiyu Gao and Haoyang Zhang and
Tianwei Lin and Wenyu Liu and Xinggang Wang
- Abstract summary: 3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal undertaking in autonomous driving.
We present a novel paradigm termed Symphonies (Scene-from-Insts), that delves into the integration of instance queries to orchestrate 2D-to-3D reconstruction and 3D scene modeling.
- Score: 49.604907627254434
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: `3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal
undertaking in autonomous driving, aiming to predict voxel occupancy within
volumetric scenes. However, prevailing methodologies primarily focus on
voxel-wise feature aggregation, while neglecting instance semantics and scene
context. In this paper, we present a novel paradigm termed Symphonies
(Scene-from-Insts), that delves into the integration of instance queries to
orchestrate 2D-to-3D reconstruction and 3D scene modeling. Leveraging our
proposed Serial Instance-Propagated Attentions, Symphonies dynamically encodes
instance-centric semantics, facilitating intricate interactions between
image-based and volumetric domains. Simultaneously, Symphonies enables holistic
scene comprehension by capturing context through the efficient fusion of
instance queries, alleviating geometric ambiguity such as occlusion and
perspective errors through contextual scene reasoning. Experimental results
demonstrate that Symphonies achieves state-of-the-art performance on
challenging benchmarks SemanticKITTI and SSCBench-KITTI-360, yielding
remarkable mIoU scores of 15.04 and 18.58, respectively. These results showcase
the paradigm's promising advancements. The code is available at
https://github.com/hustvl/Symphonies.
Related papers
- CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting [18.581169318975046]
3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, but cross-view granularity inconsistency is a problem.
We propose Context-Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS.
CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF-OVS and ScanNet.
arXiv Detail & Related papers (2025-04-16T09:20:03Z) - Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction [24.82894136068243]
PanopticRecon++ is an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective.
This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map.
PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets.
arXiv Detail & Related papers (2025-01-02T07:37:09Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.
We show that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception [17.530797215534456]
3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality.
We propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances.
Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation.
arXiv Detail & Related papers (2024-11-28T16:08:36Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting [53.6394928681237]
holistic understanding of urban scenes based on RGB images is a challenging yet important problem.
Our main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians.
Our approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy.
arXiv Detail & Related papers (2024-03-19T13:39:05Z) - InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with
Semantic Graph Prior [27.773451301040424]
InstructScene is a novel generative framework that integrates a semantic graph prior and a layout decoder.
We show that the proposed method surpasses existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2024-02-07T10:09:00Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - Incremental 3D Semantic Scene Graph Prediction from RGB Sequences [86.77318031029404]
We propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence.
Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network.
The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities.
arXiv Detail & Related papers (2023-05-04T11:32:16Z) - Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion [45.171150395915056]
3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations.
Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations.
We resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC.
arXiv Detail & Related papers (2023-03-24T12:33:44Z) - Mix3D: Out-of-Context Data Augmentation for 3D Scenes [33.939743149673696]
We present Mix3D, a data augmentation technique for segmenting large-scale 3D scenes.
In experiments, we show that models trained with Mix3D profit from a significant performance boost on indoor (ScanNet, S3DIS) and outdoor datasets.
arXiv Detail & Related papers (2021-10-05T17:57:45Z) - Semantic Scene Completion via Integrating Instances and Scene
in-the-Loop [73.11401855935726]
Semantic Scene Completion aims at reconstructing a complete 3D scene with precise voxel-wise semantics from a single-view depth or RGBD image.
We present Scene-Instance-Scene Network (textitSISNet), which takes advantages of both instance and scene level semantic information.
Our method is capable of inferring fine-grained shape details as well as nearby objects whose semantic categories are easily mixed-up.
arXiv Detail & Related papers (2021-04-08T09:50:30Z) - Semantic Scene Completion using Local Deep Implicit Functions on LiDAR
Data [4.355440821669468]
We propose a scene segmentation network based on local Deep Implicit Functions as a novel learning-based method for scene completion.
We show that this continuous representation is suitable to encode geometric and semantic properties of extensive outdoor scenes without the need for spatial discretization.
Our experiments verify that our method generates a powerful representation that can be decoded into a dense 3D description of a given scene.
arXiv Detail & Related papers (2020-11-18T07:39:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.