DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving
- URL: http://arxiv.org/abs/2501.05081v1
- Date: Thu, 09 Jan 2025 09:02:41 GMT
- Title: DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving
- Authors: Xuran Zheng, Chang D. Yoo,
- Abstract summary: multimodal large language models (MLLM) can combine multiple modalities such as pictures, videos, sounds, texts, etc.
Most MLLMs require very high computational resources, which is a major challenge for most researchers and developers.
In this paper, we explored the utility of small-scale MLLMs and applied small-scale MLLMs to the field of autonomous driving.
- Score: 20.644133177870852
- License:
- Abstract: In recent years, large language models have had a very impressive performance, which largely contributed to the development and application of artificial intelligence, and the parameters and performance of the models are still growing rapidly. In particular, multimodal large language models (MLLM) can combine multiple modalities such as pictures, videos, sounds, texts, etc., and have great potential in various tasks. However, most MLLMs require very high computational resources, which is a major challenge for most researchers and developers. In this paper, we explored the utility of small-scale MLLMs and applied small-scale MLLMs to the field of autonomous driving. We hope that this will advance the application of MLLMs in real-world scenarios.
Related papers
- Benchmarking Large and Small MLLMs [71.78055760441256]
Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content.
However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications.
Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
arXiv Detail & Related papers (2025-01-04T07:44:49Z) - DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving [13.115027801151484]
We introduce DriveMLLM, a benchmark designed to evaluate the spatial understanding capabilities of multimodal large language models (MLLMs) in autonomous driving.
DriveMLLM includes 880 front-facing camera images and introduces both absolute and relative spatial reasoning tasks, accompanied by linguistically diverse natural language questions.
We evaluate several state-of-the-art MLLMs on DriveMLLM, and our results reveal the limitations of current models in understanding complex spatial relationships in driving contexts.
arXiv Detail & Related papers (2024-11-20T08:14:01Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected.
On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data.
To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z) - Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning.
The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.
This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z) - A Review of Multi-Modal Large Language and Vision Models [1.9685736810241874]
Large Language Models (LLMs) have emerged as a focal point of research and application.
Recently, LLMs have been extended into multi-modal large language models (MM-LLMs)
This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs.
arXiv Detail & Related papers (2024-03-28T15:53:45Z) - Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks.
We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z) - A Survey on Multimodal Large Language Models for Autonomous Driving [31.614730391949657]
Multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans.
Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in Multimodal Large Language Models driving systems.
arXiv Detail & Related papers (2023-11-21T03:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.