HiLM-D: Towards High-Resolution Understanding in Multimodal Large
Language Models for Autonomous Driving
- URL: http://arxiv.org/abs/2309.05186v1
- Date: Mon, 11 Sep 2023 01:24:13 GMT
- Title: HiLM-D: Towards High-Resolution Understanding in Multimodal Large
Language Models for Autonomous Driving
- Authors: Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li
- Abstract summary: HiLM-D is an efficient method to incorporate HR information into MLLMs for the ROLISP task.
Our experiments reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
- Score: 47.274696401306514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving systems generally employ separate models for different
tasks resulting in intricate designs. For the first time, we leverage singular
multimodal large language models (MLLMs) to consolidate multiple autonomous
driving tasks from videos, i.e., the Risk Object Localization and Intention and
Suggestion Prediction (ROLISP) task. ROLISP uses natural language to
simultaneously identify and interpret risk objects, understand ego-vehicle
intentions, and provide motion suggestions, eliminating the necessity for
task-specific architectures. However, lacking high-resolution (HR) information,
existing MLLMs often miss small objects (e.g., traffic cones) and overly focus
on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D
(Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an
efficient method to incorporate HR information into MLLMs for the ROLISP task.
Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning
branch, can be any MLLMs, processes low-resolution videos to caption risk
objects and discern ego-vehicle intentions/suggestions; (ii) the
high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR
images to enhance detection by capturing vision-specific HR feature maps and
prioritizing all potential risks over merely salient objects. Our HR-PB serves
as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments
on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs,
with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for
detection.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Tell Me Where You Are: Multimodal LLMs Meet Place Recognition [11.421492098416538]
We introduce multimodal large language models (MLLMs) to visual place recognition (VPR)
Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision.
Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution.
arXiv Detail & Related papers (2024-06-25T12:59:46Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - Enhancing the Spatial Awareness Capability of Multi-Modal Large Language
Model [25.86351431223383]
The Multi-Modal Large Language Model (MLLM) is an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data.
This paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries.
arXiv Detail & Related papers (2023-10-31T10:57:35Z) - LanguageMPC: Large Language Models as Decision Makers for Autonomous
Driving [87.1164964709168]
This work employs Large Language Models (LLMs) as a decision-making component for complex autonomous driving scenarios.
Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination.
arXiv Detail & Related papers (2023-10-04T17:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.