LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering
- URL: http://arxiv.org/abs/2407.17310v2
- Date: Thu, 25 Jul 2024 07:47:11 GMT
- Title: LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering
- Authors: Simon Boeder, Fabian Gigengack, Benjamin Risse,
- Abstract summary: LangOcc is a novel approach for open vocabulary occupancy estimation.
It is trained only via camera images, and can detect arbitrary semantics via vision-language alignment.
We achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset.
- Score: 0.5852077003870417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.
Related papers
- Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking [73.05477052645885]
We introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories.
We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes.
arXiv Detail & Related papers (2024-10-02T15:48:42Z) - Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image [70.02187124865627]
Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene.
We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes.
We demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection.
arXiv Detail & Related papers (2024-07-07T04:50:04Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - S4C: Self-Supervised Semantic Scene Completion with Neural Fields [54.35865716337547]
3D semantic scene understanding is a fundamental challenge in computer vision.
Current methods for SSC are generally trained on 3D ground truth based on aggregated LiDAR scans.
Our work presents the first self-supervised approach to SSC called S4C that does not rely on 3D ground truth data.
arXiv Detail & Related papers (2023-10-11T14:19:05Z) - Unsupervised 3D Perception with 2D Vision-Language Distillation for
Autonomous Driving [39.70689418558153]
We present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels.
Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants.
arXiv Detail & Related papers (2023-09-25T19:33:52Z) - View-to-Label: Multi-View Consistency for Self-Supervised 3D Object
Detection [46.077668660248534]
We propose a novel approach to self-supervise 3D object detection purely from RGB sequences alone.
Our experiments on KITTI 3D dataset demonstrate performance on par with state-of-the-art self-supervised methods.
arXiv Detail & Related papers (2023-05-29T09:30:39Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - Unsupervised Learning of Efficient Geometry-Aware Neural Articulated
Representations [89.1388369229542]
We propose an unsupervised method for 3D geometry-aware representation learning of articulated objects.
We obviate this need by learning the representations with GAN training.
Experiments demonstrate the efficiency of our method and show that GAN-based training enables learning of controllable 3D representations without supervision.
arXiv Detail & Related papers (2022-04-19T12:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.