OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving
- URL: http://arxiv.org/abs/2412.15208v2
- Date: Fri, 14 Feb 2025 23:22:47 GMT
- Title: OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving
- Authors: Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, Zhengzhong Tu,
- Abstract summary: We propose OpenEMMA, an open-source end-to-end framework based on Multimodal Large Language Models (MLLMs)
By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline.
OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios.
- Score: 9.052643884249113
- License:
- Abstract: Since the advent of Multimodal Large Language Models (MLLMs), they have made a significant impact across a wide range of real-world applications, particularly in Autonomous Driving (AD). Their ability to process complex visual data and reason about intricate driving scenarios has paved the way for a new paradigm in end-to-end AD systems. However, the progress of developing end-to-end models for AD has been slow, as existing fine-tuning methods demand substantial resources, including extensive computational power, large-scale datasets, and significant funding. Drawing inspiration from recent advancements in inference computing, we propose OpenEMMA, an open-source end-to-end framework based on MLLMs. By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline when leveraging a diverse range of MLLMs. Furthermore, OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios, offering a more efficient and effective approach to autonomous driving. We release all the codes in https://github.com/taco-group/OpenEMMA.
Related papers
- Application of Multimodal Large Language Models in Autonomous Driving [1.8181868280594944]
We conduct in-depth study on implementing the Multi-modal Large Language Model.
We address problems with the poor performance of MLLM on Autonomous Driving.
We then break down the AD decision-making process by scene understanding, prediction, and decision-making.
arXiv Detail & Related papers (2024-12-21T00:09:52Z) - DriveMM: All-in-One Large Multimodal Model for Autonomous Driving [63.882827922267666]
DriveMM is a large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of autonomous driving tasks.
We conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks.
arXiv Detail & Related papers (2024-12-10T17:27:32Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision [9.03028904066824]
We introduce a novel approach to address Multimodal Open-Set Domain Generalization for the first time, utilizing self-supervision.
We propose two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles.
We extend our approach to tackle also the Multimodal Open-Set Domain Adaptation problem, especially in scenarios where unlabeled data from the target domain is available.
arXiv Detail & Related papers (2024-07-01T17:59:09Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - A Survey on Multimodal Large Language Models for Autonomous Driving [31.614730391949657]
Multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans.
Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in Multimodal Large Language Models driving systems.
arXiv Detail & Related papers (2023-11-21T03:32:01Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge
Collaborative AutoML System [85.8338446357469]
We introduce OmniForce, a human-centered AutoML system that yields both human-assisted ML and ML-assisted human techniques.
We show how OmniForce can put an AutoML system into practice and build adaptive AI in open-environment scenarios.
arXiv Detail & Related papers (2023-03-01T13:35:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.