RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation
- URL: http://arxiv.org/abs/2312.11829v1
- Date: Tue, 19 Dec 2023 03:39:56 GMT
- Title: RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation
- Authors: Haiming Zhang, Xu Yan, Dongfeng Bai, Jiantao Gao, Pan Wang, Bingbing
Liu, Shuguang Cui, Zhen Li
- Abstract summary: 3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
- Score: 50.35403070279804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D occupancy prediction is an emerging task that aims to estimate the
occupancy states and semantics of 3D scenes using multi-view images. However,
image-based scene perception encounters significant challenges in achieving
accurate prediction due to the absence of geometric priors. In this paper, we
address this issue by exploring cross-modal knowledge distillation in this
task, i.e., we leverage a stronger multi-modal model to guide the visual model
during training. In practice, we observe that directly applying features or
logits alignment, proposed and widely used in bird's-eyeview (BEV) perception,
does not yield satisfactory results. To overcome this problem, we introduce
RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
By employing differentiable volume rendering, we generate depth and semantic
maps in perspective views and propose two novel consistency criteria between
the rendered outputs of teacher and student models. Specifically, the depth
consistency loss aligns the termination distributions of the rendered rays,
while the semantic consistency loss mimics the intra-segment similarity guided
by vision foundation models (VLMs). Experimental results on the nuScenes
dataset demonstrate the effectiveness of our proposed method in improving
various 3D occupancy prediction approaches, e.g., our proposed methodology
enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D
benchmark.
Related papers
- ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - SAM-Guided Masked Token Prediction for 3D Scene Understanding [20.257222696422215]
Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding.
However, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation.
We introduce a novel SAM-guided tokenization method that seamlessly aligns 3D transformer structures with region-level knowledge distillation.
arXiv Detail & Related papers (2024-10-16T01:38:59Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Exploring Latent Cross-Channel Embedding for Accurate 3D Human Pose
Reconstruction in a Diffusion Framework [6.669850111205944]
Monocular 3D human pose estimation poses significant challenges due to inherent depth ambiguities that arise during the reprojection process from 2D to 3D.
Recent advancements in diffusion models have shown promise in incorporating structural priors to address reprojection ambiguities.
We propose a novel cross-channel embedding framework that aims to fully explore the correlation between joint-level features of 3D coordinates and their 2D projections.
arXiv Detail & Related papers (2024-01-18T09:53:03Z) - Parametric Depth Based Feature Representation Learning for Object
Detection and Segmentation in Bird's Eye View [44.78243406441798]
This paper focuses on leveraging geometry information, such as depth, to model such feature transformation.
We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view.
We then aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame.
arXiv Detail & Related papers (2023-07-09T06:07:22Z) - Cross-Dimensional Refined Learning for Real-Time 3D Visual Perception
from Monocular Video [2.2299983745857896]
We present a novel real-time capable learning method that jointly perceives a 3D scene's geometry structure and semantic labels.
We propose an end-to-end cross-dimensional refinement neural network (CDRNet) to extract both 3D mesh and 3D semantic labeling in real time.
arXiv Detail & Related papers (2023-03-16T11:53:29Z) - A Dual-Cycled Cross-View Transformer Network for Unified Road Layout
Estimation and 3D Object Detection in the Bird's-Eye-View [4.251500966181852]
We propose a unified model for road layout estimation and 3D object detection inspired by the transformer architecture and the CycleGAN learning framework.
We set up extensive learning scenarios to study the effect of multi-class learning for road layout estimation in various situations.
Experiment results attest the effectiveness of our model; we achieve state-of-the-art performance in both the road layout estimation and 3D object detection tasks.
arXiv Detail & Related papers (2022-09-19T08:43:38Z) - RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map.
We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z) - On Triangulation as a Form of Self-Supervision for 3D Human Pose
Estimation [57.766049538913926]
Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant.
Much of the recent attention has shifted towards semi and (or) weakly supervised learning.
We propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available.
arXiv Detail & Related papers (2022-03-29T19:11:54Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.