Learning Ego 3D Representation as Ray Tracing
- URL: http://arxiv.org/abs/2206.04042v1
- Date: Wed, 8 Jun 2022 17:55:50 GMT
- Title: Learning Ego 3D Representation as Ray Tracing
- Authors: Jiachen Lu, Zheyuan Zhou, Xiatian Zhu, Hang Xu, Li Zhang
- Abstract summary: We present a novel end-to-end architecture for ego 3D representation learning from unconstrained camera views.
Inspired by the ray tracing principle, we design a polarized grid of "imaginary eyes" as the learnable ego 3D representation.
We show that our model outperforms all state-of-the-art alternatives significantly.
- Score: 42.400505280851114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A self-driving perception model aims to extract 3D semantic representations
from multiple cameras collectively into the bird's-eye-view (BEV) coordinate
frame of the ego car in order to ground downstream planner. Existing perception
methods often rely on error-prone depth estimation of the whole scene or
learning sparse virtual 3D representations without the target geometry
structure, both of which remain limited in performance and/or capability. In
this paper, we present a novel end-to-end architecture for ego 3D
representation learning from an arbitrary number of unconstrained camera views.
Inspired by the ray tracing principle, we design a polarized grid of "imaginary
eyes" as the learnable ego 3D representation and formulate the learning process
with the adaptive attention mechanism in conjunction with the 3D-to-2D
projection. Critically, this formulation allows extracting rich 3D
representation from 2D images without any depth supervision, and with the
built-in geometry structure consistent w.r.t. BEV. Despite its simplicity and
versatility, extensive experiments on standard BEV visual tasks (e.g.,
camera-based 3D object detection and BEV segmentation) show that our model
outperforms all state-of-the-art alternatives significantly, with an extra
advantage in computational efficiency from multi-task learning.
Related papers
- BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence [11.91274849875519]
We introduce a novel image-centric 3D perception model, BIP3D, to overcome the limitations of point-centric methods.
We leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding.
In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.
arXiv Detail & Related papers (2024-11-22T11:35:42Z) - SUGAR: Pre-training 3D Visual Representations for Robotics [85.55534363501131]
We introduce a novel 3D pre-training framework for robotics named SUGAR.
SUGAR captures semantic, geometric and affordance properties of objects through 3D point clouds.
We show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations.
arXiv Detail & Related papers (2024-04-01T21:23:03Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Voxel-based 3D Detection and Reconstruction of Multiple Objects from a
Single Image [22.037472446683765]
We learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator.
Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space.
We devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation.
arXiv Detail & Related papers (2021-11-04T18:30:37Z) - GRF: Learning a General Radiance Field for 3D Representation and
Rendering [4.709764624933227]
We present a simple yet powerful neural network that implicitly represents and renders 3D objects and scenes only from 2D observations.
The network models 3D geometries as a general radiance field, which takes a set of 2D images with camera poses and intrinsics as input.
Our method can generate high-quality and realistic novel views for novel objects, unseen categories and challenging real-world scenes.
arXiv Detail & Related papers (2020-10-09T14:21:43Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.