DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
- URL: http://arxiv.org/abs/2411.13112v2
- Date: Tue, 26 Nov 2024 07:24:04 GMT
- Title: DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
- Authors: Xianda Guo, Ruijun Zhang, Yiqun Duan, Yuhang He, Chenming Zhang, Shuai Liu, Long Chen,
- Abstract summary: We introduce DriveMLLM, a benchmark designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving.
DriveMLLM includes 880 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions.
We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts.
- Score: 13.115027801151484
- License:
- Abstract: Autonomous driving requires a comprehensive understanding of 3D environments to facilitate high-level tasks such as motion prediction, planning, and mapping. In this paper, we introduce DriveMLLM, a benchmark specifically designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving. DriveMLLM includes 880 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions. To measure MLLMs' performance, we propose novel evaluation metrics focusing on spatial understanding. We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts. We believe these findings underscore the need for more advanced MLLM-based spatial reasoning methods and highlight the potential for DriveMLLM to drive further research in autonomous driving. Code will be available at \url{https://github.com/XiandaGuo/Drive-MLLM}.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment [15.52530518623987]
Large Language Models (LLMs) have the potential to enhance various aspects of autonomous driving systems.
This paper introduces novel concepts and approaches to designing LLMs for autonomous driving (LLM4AD)
arXiv Detail & Related papers (2024-10-20T04:36:19Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving.
Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral
Planning States for Autonomous Driving [69.82743399946371]
DriveMLM is a framework that can perform close-loop autonomous driving in realistic simulators.
We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system.
This model can plug-and-play in existing AD systems such as Apollo for close-loop driving.
arXiv Detail & Related papers (2023-12-14T18:59:05Z) - Evaluation of Large Language Models for Decision Making in Autonomous
Driving [4.271294502084542]
One strategy of using Large Language Models (LLMs) for autonomous driving involves inputting surrounding objects as text prompts to the LLMs.
When using LLMs for such purposes, capabilities such as spatial recognition and planning are essential.
This study quantitatively evaluated these two abilities of LLMs in the context of autonomous driving.
arXiv Detail & Related papers (2023-12-11T12:56:40Z) - Enhancing the Spatial Awareness Capability of Multi-Modal Large Language
Model [25.86351431223383]
The Multi-Modal Large Language Model (MLLM) is an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data.
This paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries.
arXiv Detail & Related papers (2023-10-31T10:57:35Z) - LanguageMPC: Large Language Models as Decision Makers for Autonomous
Driving [87.1164964709168]
This work employs Large Language Models (LLMs) as a decision-making component for complex autonomous driving scenarios.
Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination.
arXiv Detail & Related papers (2023-10-04T17:59:49Z) - Driving with LLMs: Fusing Object-Level Vector Modality for Explainable
Autonomous Driving [6.728693243652425]
Large Language Models (LLMs) have shown promise in the autonomous driving sector, particularly in generalization and interpretability.
We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations.
arXiv Detail & Related papers (2023-10-03T11:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.