Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models
- URL: http://arxiv.org/abs/2310.17642v1
- Date: Thu, 26 Oct 2023 17:56:35 GMT
- Title: Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models
- Authors: Tsun-Hsuan Wang and Alaa Maalouf and Wei Xiao and Yutong Ban and
Alexander Amini and Guy Rosman and Sertac Karaman and Daniela Rus
- Abstract summary: We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
- Score: 114.69732301904419
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As autonomous driving technology matures, end-to-end methodologies have
emerged as a leading strategy, promising seamless integration from perception
to control via deep learning. However, existing systems grapple with challenges
such as unexpected open set environments and the complexity of black-box
models. At the same time, the evolution of deep learning introduces larger,
multimodal foundational models, offering multi-modal visual and textual
understanding. In this paper, we harness these multimodal foundation models to
enhance the robustness and adaptability of autonomous driving systems, enabling
out-of-distribution, end-to-end, multimodal, and more explainable autonomy.
Specifically, we present an approach to apply end-to-end open-set (any
environment/scene) autonomous driving that is capable of providing driving
decisions from representations queryable by image and text. To do so, we
introduce a method to extract nuanced spatial (pixel/patch-aligned) features
from transformers to enable the encapsulation of both spatial and semantic
features. Our approach (i) demonstrates unparalleled results in diverse tests
while achieving significantly greater robustness in out-of-distribution
situations, and (ii) allows the incorporation of latent space simulation (via
text) for improved training (data augmentation via text) and policy debugging.
We encourage the reader to check our explainer video at
https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the
code and demos on our project webpage at https://drive-anywhere.github.io/.
Related papers
- Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - LLM4Drive: A Survey of Large Language Models for Autonomous Driving [62.10344445241105]
Large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers.
In this paper, we systematically review a research line about textitLarge Language Models for Autonomous Driving (LLM4AD).
arXiv Detail & Related papers (2023-11-02T07:23:33Z) - End-to-end Autonomous Driving: Challenges and Frontiers [45.391430626264764]
We provide a comprehensive analysis of more than 270 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving.
We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others.
We discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework.
arXiv Detail & Related papers (2023-06-29T14:17:24Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - MultiPath++: Efficient Information Fusion and Trajectory Aggregation for
Behavior Prediction [42.563865078323204]
We present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks.
We show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition and Open Motion Prediction Challenge.
arXiv Detail & Related papers (2021-11-29T21:36:53Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion
Approach for Referencing Outside Objects From a Moving Vehicle [0.0]
We propose a learning-based multimodal fusion approach for referencing outside-the-vehicle objects while maintaining a long driving route in a simulated environment.
We also demonstrate possible ways to exploit behavioral differences between users when completing the referencing task to realize an adaptable personalized system for each driver.
arXiv Detail & Related papers (2021-11-03T16:22:17Z) - SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for
Autonomous Driving [96.50297622371457]
Multi-agent interaction is a fundamental aspect of autonomous driving in the real world.
Despite more than a decade of research and development, the problem of how to interact with diverse road users in diverse scenarios remains largely unsolved.
We develop a dedicated simulation platform called SMARTS that generates diverse and competent driving interactions.
arXiv Detail & Related papers (2020-10-19T18:26:10Z) - Interpretable End-to-end Urban Autonomous Driving with Latent Deep
Reinforcement Learning [32.97789225998642]
We propose an interpretable deep reinforcement learning method for end-to-end autonomous driving.
A sequential latent environment model is introduced and learned jointly with the reinforcement learning process.
Our method is able to provide a better explanation of how the car reasons about the driving environment.
arXiv Detail & Related papers (2020-01-23T18:36:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.