MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving
Representation Learning
- URL: http://arxiv.org/abs/2403.08760v1
- Date: Wed, 13 Mar 2024 17:58:00 GMT
- Title: MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving
Representation Learning
- Authors: Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
- Abstract summary: MIM4D is a novel pre-training paradigm based on dual masked image modeling (MIM)
It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision.
It achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving.
- Score: 38.6654451726187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning robust and scalable visual representations from massive multi-view
video data remains a challenge in computer vision and autonomous driving.
Existing pre-training methods either rely on expensive supervised learning with
3D annotations, limiting the scalability, or focus on single-frame or monocular
inputs, neglecting the temporal information. We propose MIM4D, a novel
pre-training paradigm based on dual masked image modeling (MIM). MIM4D
leverages both spatial and temporal relations by training on masked multi-view
video inputs. It constructs pseudo-3D features using continuous scene flow and
projects them onto 2D plane for supervision. To address the lack of dense 3D
supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable
rendering to learn geometric representations. We demonstrate that MIM4D
achieves state-of-the-art performance on the nuScenes dataset for visual
representation learning in autonomous driving. It significantly improves
existing methods on multiple downstream tasks, including BEV segmentation (8.7%
IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our
work offers a new choice for learning representation at scale in autonomous
driving. Code and models are released at https://github.com/hustvl/MIM4D
Related papers
- EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - Animate3D: Animating Any 3D Model with Multi-view Video Diffusion [47.05131487114018]
Animate3D is a novel framework for animating any static 3D model.
We introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects.
arXiv Detail & Related papers (2024-07-16T05:35:57Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - Point Cloud Self-supervised Learning via 3D to Multi-view Masked
Autoencoder [21.73287941143304]
Multi-Modality Masked AutoEncoders (MAE) methods leverage both 2D images and 3D point clouds for pre-training.
We introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds.
Our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks.
arXiv Detail & Related papers (2023-11-17T22:10:03Z) - SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving [98.74706005223685]
3D scene understanding plays a vital role in vision-based autonomous driving.
We propose a SurroundOcc method to predict the 3D occupancy with multi-camera images.
arXiv Detail & Related papers (2023-03-16T17:59:08Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - PerMO: Perceiving More at Once from a Single Image for Autonomous
Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image.
Our approach combines the strengths of deep learning and the elegance of traditional techniques.
We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z) - Multi-View Matching (MVM): Facilitating Multi-Person 3D Pose Estimation
Learning with Action-Frozen People Video [38.63662549684785]
MVM method generates reliable 3D human poses from a large-scale video dataset.
We train a neural network that takes a single image as the input for multi-person 3D pose estimation.
arXiv Detail & Related papers (2020-04-11T01:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.