HiLM-D: Towards High-Resolution Understanding in Multimodal Large
Language Models for Autonomous Driving
- URL: http://arxiv.org/abs/2309.05186v1
- Date: Mon, 11 Sep 2023 01:24:13 GMT
- Title: HiLM-D: Towards High-Resolution Understanding in Multimodal Large
Language Models for Autonomous Driving
- Authors: Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li
- Abstract summary: HiLM-D is an efficient method to incorporate HR information into MLLMs for the ROLISP task.
Our experiments reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
- Score: 47.274696401306514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving systems generally employ separate models for different
tasks resulting in intricate designs. For the first time, we leverage singular
multimodal large language models (MLLMs) to consolidate multiple autonomous
driving tasks from videos, i.e., the Risk Object Localization and Intention and
Suggestion Prediction (ROLISP) task. ROLISP uses natural language to
simultaneously identify and interpret risk objects, understand ego-vehicle
intentions, and provide motion suggestions, eliminating the necessity for
task-specific architectures. However, lacking high-resolution (HR) information,
existing MLLMs often miss small objects (e.g., traffic cones) and overly focus
on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D
(Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an
efficient method to incorporate HR information into MLLMs for the ROLISP task.
Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning
branch, can be any MLLMs, processes low-resolution videos to caption risk
objects and discern ego-vehicle intentions/suggestions; (ii) the
high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR
images to enhance detection by capturing vision-specific HR feature maps and
prioritizing all potential risks over merely salient objects. Our HR-PB serves
as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments
on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs,
with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for
detection.
Related papers
- MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics? [33.573056018368504]
This study introduces the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark.
We identify four essential capabilities perception, task planning, visual reasoning, and safety measurement that MLLMs must possess to qualify as the robot's central processing unit.
Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots.
arXiv Detail & Related papers (2024-06-28T07:09:06Z) - Tell Me Where You Are: Multimodal LLMs Meet Place Recognition [11.421492098416538]
We introduce multimodal large language models (MLLMs) to visual place recognition (VPR)
Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision.
Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution.
arXiv Detail & Related papers (2024-06-25T12:59:46Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs)
MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task.
We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - Enhancing the Spatial Awareness Capability of Multi-Modal Large Language
Model [25.86351431223383]
The Multi-Modal Large Language Model (MLLM) is an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data.
This paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries.
arXiv Detail & Related papers (2023-10-31T10:57:35Z) - LanguageMPC: Large Language Models as Decision Makers for Autonomous
Driving [87.1164964709168]
This work employs Large Language Models (LLMs) as a decision-making component for complex autonomous driving scenarios.
Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination.
arXiv Detail & Related papers (2023-10-04T17:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.