VLP: Vision Language Planning for Autonomous Driving
- URL: http://arxiv.org/abs/2401.05577v3
- Date: Sat, 9 Mar 2024 20:22:04 GMT
- Title: VLP: Vision Language Planning for Autonomous Driving
- Authors: Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik,
Alessandro G Allievi, Senem Velipasalar, Liu Ren
- Abstract summary: This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving.
It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
- Score: 54.907602890752045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving is a complex and challenging task that aims at safe motion
planning through scene understanding and reasoning. While vision-only
autonomous driving methods have recently achieved notable performance, through
enhanced scene understanding, several key issues, including lack of reasoning,
low generalization performance and long-tail scenarios, still need to be
addressed. In this paper, we present VLP, a novel Vision-Language-Planning
framework that exploits language models to bridge the gap between linguistic
understanding and autonomous driving. VLP enhances autonomous driving systems
by strengthening both the source memory foundation and the self-driving car's
contextual understanding. VLP achieves state-of-the-art end-to-end planning
performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\%
reduction in terms of average L2 error and collision rates, respectively,
compared to the previous best method. Moreover, VLP shows improved performance
in challenging long-tail scenarios and strong generalization capabilities when
faced with new urban environments.
Related papers
- Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving [43.156632952193966]
Traditional end-to-end driving models suffer from long-tail events due to rare or unseen inputs within their training distributions.
We propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge.
ToKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model.
arXiv Detail & Related papers (2024-07-01T04:34:50Z) - DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [31.552397390480525]
We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs)
DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning.
We propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.
arXiv Detail & Related papers (2024-02-19T17:04:04Z) - RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model [22.25903116720301]
explainability plays a critical role in trustworthy autonomous decision-making.
Recent advancements in Multi-Modal Large Language models (MLLMs) have shown promising potential in enhancing the explainability as a driving agent.
We present RAG-Driver, a novel retrieval-augmented multi-modal large language model that leverages in-context learning for high-performance, explainable, and generalisable autonomous driving.
arXiv Detail & Related papers (2024-02-16T16:57:18Z) - LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning [65.86754998249224]
We develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner.
Our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach.
arXiv Detail & Related papers (2023-12-30T02:53:45Z) - Receive, Reason, and React: Drive as You Say with Large Language Models
in Autonomous Vehicles [13.102404404559428]
We propose a novel framework that leverages Large Language Models (LLMs) to enhance the decision-making process in autonomous vehicles.
Our research includes experiments in HighwayEnv, a collection of environments for autonomous driving and tactical decision-making tasks.
We also examine real-time personalization, demonstrating how LLMs can influence driving behaviors based on verbal commands.
arXiv Detail & Related papers (2023-10-12T04:56:01Z) - GPT-Driver: Learning to Drive with GPT [47.14350537515685]
We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into a reliable motion planner for autonomous vehicles.
We capitalize on the strong reasoning capabilities and generalization potential inherent to Large Language Models (LLMs)
We evaluate our approach on the large-scale nuScenes dataset, and extensive experiments substantiate the effectiveness, generalization ability, and interpretability of our GPT-based motion planner.
arXiv Detail & Related papers (2023-10-02T17:59:57Z) - DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs)
DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4.
arXiv Detail & Related papers (2023-10-02T17:59:52Z) - EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
Backbone [67.13773226242242]
Video-language pre-training can generalize to various vision and language tasks.
Video-language pre-training frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning.
New generation of egocentric video-language pre-training incorporates cross-modal fusion directly into the video and language backbones.
arXiv Detail & Related papers (2023-07-11T17:50:15Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Connecting Language and Vision for Natural Language-Based Vehicle
Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest.
To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model.
Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.