MUVO: A Multimodal Generative World Model for Autonomous Driving with
Geometric Representations
- URL: http://arxiv.org/abs/2311.11762v2
- Date: Thu, 23 Nov 2023 17:26:53 GMT
- Title: MUVO: A Multimodal Generative World Model for Autonomous Driving with
Geometric Representations
- Authors: Daniel Bogdoll, Yitian Yang, J. Marius Z\"ollner
- Abstract summary: unsupervised world models for autonomous driving have the potential to improve the reasoning capabilities of today's systems dramatically.
We propose MUVO, a MUltimodal World Model with Geometric VOxel Representations to address this challenge.
- Score: 13.07537039737708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning unsupervised world models for autonomous driving has the potential
to improve the reasoning capabilities of today's systems dramatically. However,
most work neglects the physical attributes of the world and focuses on sensor
data alone. We propose MUVO, a MUltimodal World Model with Geometric VOxel
Representations to address this challenge. We utilize raw camera and lidar data
to learn a sensor-agnostic geometric representation of the world, which can
directly be used by downstream tasks, such as planning. We demonstrate
multimodal future predictions and show that our geometric representation
improves the prediction quality of both camera images and lidar point clouds.
Related papers
- Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond [101.15395503285804]
General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI)
In this survey, we embark on a comprehensive exploration of the latest advancements in world models.
We examine challenges and limitations of world models, and discuss their potential future directions.
arXiv Detail & Related papers (2024-05-06T14:37:07Z) - Driving into the Future: Multiview Visual Forecasting and Planning with
World Model for Autonomous Driving [56.381918362410175]
Drive-WM is the first driving world model compatible with existing end-to-end planning models.
Our model generates high-fidelity multiview videos in driving scenes.
arXiv Detail & Related papers (2023-11-29T18:59:47Z) - ADriver-I: A General World Model for Autonomous Driving [23.22507419707926]
We introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals.
Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I.
It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame.
arXiv Detail & Related papers (2023-11-22T17:44:29Z) - Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion [36.321494200830244]
Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion.
Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
arXiv Detail & Related papers (2023-11-02T06:21:56Z) - GAIA-1: A Generative World Model for Autonomous Driving [9.578453700755318]
We introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that generates realistic driving scenarios.
Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry.
arXiv Detail & Related papers (2023-09-29T09:20:37Z) - Neural World Models for Computer Vision [2.741266294612776]
We present a framework to train a world model and a policy, parameterised by deep neural networks.
We leverage important computer vision concepts such as geometry, semantics, and motion to scale world models to complex urban driving scenes.
Our model can jointly predict static scene, dynamic scene, and ego-behaviour in an urban driving environment.
arXiv Detail & Related papers (2023-06-15T14:58:21Z) - Policy Pre-training for End-to-end Autonomous Driving via
Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving.
We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos.
In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input.
In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z) - HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for
Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians.
Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin.
Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing and
Policy Learning for Autonomous Vehicles [131.2240621036954]
We present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles.
Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras.
We demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle.
arXiv Detail & Related papers (2021-11-23T18:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.