Related papers: ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

URL: http://arxiv.org/abs/2410.11019v2
Date: Sat, 01 Mar 2025 18:48:48 GMT
Title: ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera
Authors: Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Rajasimman Madhivanan, Dinesh Manocha,
Abstract summary: We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera.<n>Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions.
Score: 53.20087549782785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB. Project page: https://github.com/jingGM/ET-Former.git.

Related papers

MetaOcc: Surround-View 4D Radar and Camera Fusion Framework for 3D Occupancy Prediction with Dual Training Strategies [10.662778683303726]
We propose MetaOcc, a novel multi-modal occupancy prediction framework. We first design a height self-attention module for effective 3D feature extraction from sparse radar points. Finally, we develop a semi-supervised training procedure leveraging open-set segmentor and geometric constraints for pseudo-label generation.
arXiv Detail & Related papers (2025-01-26T03:51:56Z)
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks. We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z)
Context-Conditioned Spatio-Temporal Predictive Learning for Reliable V2V Channel Prediction [25.688521281119037]
Vehicle-to-Vehicle (V2V) channel state information (CSI) prediction is challenging and crucial for optimizing downstream tasks. Traditional prediction approaches focus on four-dimensional (4D) CSI, which includes predictions over time, bandwidth, and antenna (TX and RX) space. We propose a novel context-conditionedtemporal predictive learning method to capture dependencies within 4D CSI data.
arXiv Detail & Related papers (2024-09-16T04:15:36Z)
One Homography is All You Need: IMM-based Joint Homography and Multiple Object State Estimation [2.09942566943801]
IMM Joint Homography State Estimation (IMM-JHSE) is proposed. IMM-JHSE uses an initial homography estimate as the only additional 3D information. IMM-JHSE offers competitive performance on the MOT17, MOT20 and KITTI-car datasets.
arXiv Detail & Related papers (2024-09-04T09:29:24Z)
Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving [45.886941596233974]
LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation. Our proposed framework performs L-OGM prediction in the latent space of a generative architecture. We decode predictions using either a single-step decoder, which provides high-quality predictions in real-time, or a diffusion-based batch decoder.
arXiv Detail & Related papers (2024-07-30T18:37:59Z)
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z)
OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision. We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range. For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z)
Camera-based 3D Semantic Scene Completion with Sparse Guidance Network [18.415854443539786]
We propose a camera-based semantic scene completion framework called SGN. SGN propagates semantics from semantic-aware seed voxels to the whole scene based on spatial geometry cues. Our experimental results demonstrate the superiority of our SGN over existing state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T04:17:27Z)
Exploiting Diffusion Prior for Generalizable Dense Prediction [85.4563592053464]
Recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
arXiv Detail & Related papers (2023-11-30T18:59:44Z)
Volumetric Semantically Consistent 3D Panoptic Mapping [77.13446499924977]
We introduce an online 2D-to-3D semantic instance mapping algorithm aimed at generating semantic 3D maps suitable for autonomous agents in unstructured environments. It introduces novel ways of integrating semantic prediction confidence during mapping, producing semantic and instance-consistent 3D regions. The proposed method achieves accuracy superior to the state of the art on public large-scale datasets, improving on a number of widely used metrics.
arXiv Detail & Related papers (2023-09-26T08:03:10Z)
Real-time 3D Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction [17.54862035445157]
We propose a real-time semantic scene completion method with a feature aggregation strategy and conditioned prediction module. Our method achieves competitive performance at a speed of 110 FPS on one GTX 1080 Ti GPU.
arXiv Detail & Related papers (2023-03-20T09:41:50Z)
SoK: Vehicle Orientation Representations for Deep Rotation Estimation [2.052323405257355]
We study the accuracy performance of various existing orientation representations using the KITTI 3D object detection dataset. We propose a new form of orientation representation: Tricosine.
arXiv Detail & Related papers (2021-12-08T17:12:54Z)
Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations. In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z)
CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth [83.77839773394106]
We present a lightweight, tightly-coupled deep depth network and visual-inertial odometry system. We provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction. We show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.
arXiv Detail & Related papers (2020-12-18T09:42:54Z)
3D IoU-Net: IoU Guided 3D Object Detector for Point Clouds [68.44740333471792]
We add a 3D IoU prediction branch to the regular classification and regression branches. We propose a 3D IoU-Net with IoU sensitive feature learning and an IoU alignment operation. The experimental results on the KITTI car detection benchmark show that 3D IoU-Net with IoU perception achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-04-10T09:24:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.