AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
- URL: http://arxiv.org/abs/2502.04981v2
- Date: Wed, 12 Mar 2025 03:12:18 GMT
- Title: AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
- Authors: Xiaoyu Zhou, Jingqi Wang, Yongtao Wang, Yufei Wei, Nan Dong, Ming-Hsuan Yang,
- Abstract summary: AutoOcc is a vision-centric automated pipeline for semantic occupancy annotation.<n>We formulate the open-ended semantic occupancy reconstruction task to automatically generate scene occupancy.<n>Our framework outperforms existing automated occupancy annotation methods without human labels.
- Score: 46.677120329555486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Obtaining high-quality 3D semantic occupancy from raw sensor data remains an essential yet challenging task, often requiring extensive manual labeling. In this work, we propose AutoOcc, an vision-centric automated pipeline for open-ended semantic occupancy annotation that integrates differentiable Gaussian splatting guided by vision-language models. We formulate the open-ended semantic occupancy reconstruction task to automatically generate scene occupancy by combining attention maps from vision-language models and foundation vision models. We devise semantic-aware Gaussians as intermediate geometric descriptors and propose a cumulative Gaussian-to-voxel splatting algorithm that enables effective and efficient occupancy annotation. Our framework outperforms existing automated occupancy annotation methods without human labels. AutoOcc also enables open-ended semantic occupancy auto-labeling, achieving robust performance in both static and dynamically complex scenarios. All the source codes and trained models will be released.
Related papers
- SG-Reg: Generalizable and Efficient Scene Graph Registration [23.3853919684438]
We design a scene graph network to encode multiple modalities of semantic nodes.
In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences.
Our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame.
arXiv Detail & Related papers (2025-04-20T01:22:40Z) - EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.
EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z) - GeomGS: LiDAR-Guided Geometry-Aware Gaussian Splatting for Robot Localization [20.26969580492428]
We propose a novel 3DGS method called Geometry-Aware Gaussian Splatting (GeomGS)<n>Our GeomGS demonstrates state-of-the-art geometric and localization performance across several benchmarks, while also improving photometric performance.
arXiv Detail & Related papers (2025-01-23T06:43:38Z) - EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding [63.99937807085461]
3D occupancy prediction provides a comprehensive description of the surrounding scenes.<n>Most existing methods focus on offline perception from one or a few views.<n>We formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it.
arXiv Detail & Related papers (2024-12-05T17:57:09Z) - PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting [54.7468067660037]
PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.
Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS.
arXiv Detail & Related papers (2024-10-29T15:28:15Z) - Lift, Splat, Map: Lifting Foundation Masks for Label-Free Semantic Scene Completion [7.781799395896687]
We propose LSMap, a method to predict a continuous, open-set semantic and elevation-aware representation in bird's eye view.
Our model only requires a single RGBD image, does not require human labels, and operates in real time.
We show that our pre-trained representation outperforms existing visual foundation models at unsupervised semantic scene completion.
arXiv Detail & Related papers (2024-07-03T18:08:05Z) - Effective Rank Analysis and Regularization for Enhanced 3D Gaussian Splatting [33.01987451251659]
3D Gaussian Splatting (3DGS) has emerged as a promising technique capable of real-time rendering with high-quality 3D reconstruction.<n>Despite its potential, 3DGS encounters challenges such as needle-like artifacts, suboptimal geometries, and inaccurate normals.<n>We introduce the effective rank as a regularization, which constrains the structure of the Gaussians.
arXiv Detail & Related papers (2024-06-17T15:51:59Z) - Trim 3D Gaussian Splatting for Accurate Geometry Representation [72.00970038074493]
We introduce Trim 3D Gaussian Splatting (TrimGS) to reconstruct accurate 3D geometry from images.
Our experimental and theoretical analyses reveal that a relatively small Gaussian scale is a non-negligible factor in representing and optimizing the intricate details.
When combined with the original 3DGS and the state-of-the-art 2DGS, TrimGS consistently yields more accurate geometry and higher perceptual quality.
arXiv Detail & Related papers (2024-06-11T17:34:46Z) - Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy [3.713586225621126]
A robot must have the ability to identify semantically traversable terrains in the image based on the semantic understanding of the scene.
This reasoning ability is based on semantic traversability, which is frequently achieved using semantic segmentation models fine-tuned on the testing domain.
We present an effective methodology for training a semantic traversability estimator using egocentric videos and an automated annotation process.
arXiv Detail & Related papers (2024-06-05T06:40:04Z) - GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction [70.65250036489128]
3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene.
We propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians.
GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
arXiv Detail & Related papers (2024-05-27T17:59:51Z) - Label-efficient Semantic Scene Completion with Scribble Annotations [29.88371368606911]
We build a new label-efficient benchmark, named ScribbleSC, where the sparse scribble-based semantic labels are combined with dense geometric labels for semantic scene completion.
Our method consists of geometric-aware auto-labelers construction and online model training with an offline-to-online distillation module to enhance the performance.
arXiv Detail & Related papers (2024-05-24T03:09:50Z) - CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding [32.76277160013881]
We present CLIP-GS, which integrates semantics from Contrastive Language-Image Pre-Training (CLIP) into Gaussian Splatting.
SAC exploits the inherent unified semantics within objects to learn compact yet effective semantic representations of 3D Gaussians.
We also introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model.
arXiv Detail & Related papers (2024-04-22T15:01:32Z) - Agent-driven Generative Semantic Communication with Cross-Modality and Prediction [57.335922373309074]
We propose a novel agent-driven generative semantic communication framework based on reinforcement learning.
In this work, we develop an agent-assisted semantic encoder with cross-modality capability, which can track the semantic changes, channel condition, to perform adaptive semantic extraction and sampling.
The effectiveness of the designed models has been verified using the UA-DETRAC dataset, demonstrating the performance gains of the overall A-GSC framework.
arXiv Detail & Related papers (2024-04-10T13:24:27Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic
Occupancy Perception [73.05425657479704]
We propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark.
We extend the large-scale nuScenes dataset with dense semantic occupancy annotations.
Considering the complexity of surrounding occupancy perception, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction.
arXiv Detail & Related papers (2023-03-07T15:43:39Z) - Probabilistic Semantic Mapping for Urban Autonomous Driving Applications [1.181206257787103]
We propose to fuse image and pre-built point cloud map information to perform automatic and accurate labeling of static landmarks such as roads, sidewalks, crosswalks, and lanes.
The method performs semantic segmentation on 2D images, associates the semantic labels with point cloud maps to accurately localize them in the world, and leverages the confusion matrix formulation to construct a probabilistic semantic map in bird's eye view from semantic point clouds.
arXiv Detail & Related papers (2020-06-08T19:29:09Z) - SideInfNet: A Deep Neural Network for Semi-Automatic Semantic
Segmentation with Side Information [83.03179580646324]
This paper proposes a novel deep neural network architecture, namely SideInfNet.
It integrates features learnt from images with side information extracted from user annotations.
To evaluate our method, we applied the proposed network to three semantic segmentation tasks and conducted extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2020-02-07T06:10:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.