Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction
- URL: http://arxiv.org/abs/2501.01119v1
- Date: Thu, 02 Jan 2025 07:37:09 GMT
- Title: Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction
- Authors: Xuan Yu, Yuxuan Xie, Yili Liu, Haojian Lu, Rong Xiong, Yiyi Liao, Yue Wang,
- Abstract summary: PanopticRecon++ is an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective.<n>This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map.<n>PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets.
- Score: 24.82894136068243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/
Related papers
- Segment then Splat: A Unified Approach for 3D Open-Vocabulary Segmentation based on Gaussian Splatting [11.186317340623807]
Open-vocabulary querying in 3D space is crucial for enabling more intelligent perception in applications such as robotics, autonomous systems, and augmented reality.
Most existing methods rely on 2D pixel-level parsing, leading to multi-view inconsistencies and poor 3D object retrieval.
We propose Segment then, a 3D-aware open vocabulary segmentation approach for both static and dynamic scenes.
arXiv Detail & Related papers (2025-03-28T07:36:51Z) - SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation [50.420711084672966]
We present SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction.
Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories.
arXiv Detail & Related papers (2025-01-28T03:41:24Z) - DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting [0.0]
We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations.
We evaluate on synthetic and real-world indoor datasets, demonstrating improved performance over comparable NeRF-based pipelines.
arXiv Detail & Related papers (2024-12-14T21:26:44Z) - 3D Part Segmentation via Geometric Aggregation of 2D Visual Features [57.20161517451834]
Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios.<n>Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts.<n>To address these limitations, we propose COPS, a COmprehensive model for Parts that blends semantics extracted from visual concepts and 3D geometry to effectively identify object parts.
arXiv Detail & Related papers (2024-12-05T15:27:58Z) - Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding [59.51535163599723]
FreeGS is an unsupervised semantic-embedded 3DGS framework that achieves view-consistent 3D scene understanding without the need for 2D labels.
We show that FreeGS performs comparably to state-of-the-art methods while avoiding the complex data preprocessing workload.
arXiv Detail & Related papers (2024-11-29T08:52:32Z) - Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding.
An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z) - PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction [23.798691661418253]
We propose a novel zero-shot panoptic reconstruction method from RGB-D images of scenes.
We tackle both challenges by propagating partial labels with the aid of dense generalized features.
Our method outperforms state-of-the-art methods on the indoor dataset ScanNet V2 and the outdoor dataset KITTI-360.
arXiv Detail & Related papers (2024-07-01T15:06:04Z) - Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes.<n>We show it outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z) - View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields [52.08335264414515]
We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene.
Our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output.
We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency.
arXiv Detail & Related papers (2024-05-30T04:14:58Z) - Fully Sparse 3D Occupancy Prediction [37.265473869812816]
Occupancy prediction plays a pivotal role in autonomous driving.
Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering from high computational costs.
We introduce a novel fully sparse occupancy network, termed SparseOcc.
SparseOcc initially reconstructs a sparse 3D representation from camera-only inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries.
arXiv Detail & Related papers (2023-12-28T16:54:53Z) - Panoptic Lifting for 3D Scene Understanding with Neural Fields [32.59498558663363]
We propose a novel approach for learning panoptic 3D representations from images of in-the-wild scenes.
Our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network.
Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets.
arXiv Detail & Related papers (2022-12-19T19:15:36Z) - SCFusion: Real-time Incremental Scene Reconstruction with Semantic
Completion [86.77318031029404]
We propose a framework that performs scene reconstruction and semantic scene completion jointly in an incremental and real-time manner.
Our framework relies on a novel neural architecture designed to process occupancy maps and leverages voxel states to accurately and efficiently fuse semantic completion with the 3D global model.
arXiv Detail & Related papers (2020-10-26T15:31:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.