DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
- URL: http://arxiv.org/abs/2412.07689v3
- Date: Fri, 13 Dec 2024 08:13:44 GMT
- Title: DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
- Authors: Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma,
- Abstract summary: DriveMM is a large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of autonomous driving tasks.<n>We conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks.
- Score: 63.882827922267666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.
Related papers
- LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving [9.447298958886265]
Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving.
We introduce LightEMMA, a Lightweight End-to-End Multimodal Model for Autonomous driving.
We construct twelve autonomous driving agents using various VLMs and evaluate their performance on the nuScenes prediction task.
arXiv Detail & Related papers (2025-05-01T04:12:41Z) - DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.
Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.
Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey [50.62538723793247]
Driving World Model (DWM) focuses on predicting scene evolution during the driving process.
DWM methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments.
arXiv Detail & Related papers (2025-02-14T18:43:15Z) - OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving [9.052643884249113]
We propose OpenEMMA, an open-source end-to-end framework based on Multimodal Large Language Models (MLLMs)
By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline.
OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios.
arXiv Detail & Related papers (2024-12-19T18:59:40Z) - EMMA: End-to-End Multimodal Model for Autonomous Driving [56.972452552944056]
We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving.
Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs.
arXiv Detail & Related papers (2024-10-30T17:46:31Z) - Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving [0.0]
We develop an efficient, lightweight, multi-frame vision language model which performs Visual Question Answering for autonomous driving.
In comparison to previous approaches, EM-VLM4AD requires at least 10 times less memory and floating point operations.
arXiv Detail & Related papers (2024-03-28T21:18:33Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral
Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators.
We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system.
This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.