DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
- URL: http://arxiv.org/abs/2412.07689v3
- Date: Fri, 13 Dec 2024 08:13:44 GMT
- Title: DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
- Authors: Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma,
- Abstract summary: DriveMM is a large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of autonomous driving tasks.<n>We conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks.
- Score: 63.882827922267666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.
Related papers
- LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving [9.447298958886265]
Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving.
We introduce LightEMMA, a Lightweight End-to-End Multimodal Model for Autonomous driving.
We construct twelve autonomous driving agents using various VLMs and evaluate their performance on the nuScenes prediction task.
arXiv Detail & Related papers (2025-05-01T04:12:41Z) - DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.
Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.
Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process.
DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z) - OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving [9.052643884249113]
We propose OpenEMMA, an open-source end-to-end framework based on Multimodal Large Language Models (MLLMs)
By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline.
OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios.
arXiv Detail & Related papers (2024-12-19T18:59:40Z) - GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control [122.65089441381741]
We present GEM, a Generalizable Ego-vision Multimodal world model.<n>It predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories.<n>Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights.
arXiv Detail & Related papers (2024-12-15T14:21:19Z) - EMMA: End-to-End Multimodal Model for Autonomous Driving [56.972452552944056]
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving.
Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs.
arXiv Detail & Related papers (2024-10-30T17:46:31Z) - Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [0.0]
We develop an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving.
In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations.
arXiv Detail & Related papers (2024-03-28T21:18:33Z) - GenAD: Generalized Predictive Model for Autonomous Driving [75.39517472462089]
We introduce the first large-scale video prediction model in the autonomous driving discipline.
Our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks.
It can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.
arXiv Detail & Related papers (2024-03-14T17:58:33Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral
Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators.
We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system.
This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z) - ADriver-I: A General World Model for Autonomous Driving [23.22507419707926]
We introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals.
Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I.
It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame.
arXiv Detail & Related papers (2023-11-22T17:44:29Z) - CERBERUS: Simple and Effective All-In-One Automotive Perception Model
with Multi Task Learning [4.622165486890318]
In-vehicle embedded computing platforms cannot cope with the computational effort required to run a heavy model for each individual task.
We present CERBERUS, a lightweight model that leverages a multitask-learning approach to enable the execution of multiple perception tasks at the cost of a single inference.
arXiv Detail & Related papers (2022-10-03T08:17:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.