VLP: Vision Language Planning for Autonomous Driving
- URL: http://arxiv.org/abs/2401.05577v4
- Date: Sat, 23 Nov 2024 18:49:18 GMT
- Title: VLP: Vision Language Planning for Autonomous Driving
- Authors: Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, Liu Ren,
- Abstract summary: This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving.
It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
- Score: 52.640371249017335
- License:
- Abstract: Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance, through enhanced scene understanding, several key issues, including lack of reasoning, low generalization performance and long-tail scenarios, still need to be addressed. In this paper, we present VLP, a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\% reduction in terms of average L2 error and collision rates, respectively, compared to the previous best method. Moreover, VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.
Related papers
- VLM-Assisted Continual learning for Visual Question Answering in Self-Driving [26.413685340816436]
We propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving.
In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings.
We present a novel continual learning framework that combines Vision-Language Models with selective memory replay and knowledge distillation.
arXiv Detail & Related papers (2025-02-02T16:27:44Z) - Distilling Multi-modal Large Language Models for Autonomous Driving [64.63127269187814]
Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events.
We propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM.
Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios.
arXiv Detail & Related papers (2025-01-16T18:59:53Z) - Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving [20.33096710167997]
generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving.
Cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories.
It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.
arXiv Detail & Related papers (2025-01-15T15:20:46Z) - Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [2.0122032639916485]
We analyze effective knowledge distillation of semantic labels to smaller Vision networks.
This can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
arXiv Detail & Related papers (2025-01-12T01:31:07Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.
VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours.
This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z) - DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs)
DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z) - EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
Backbone [67.13773226242242]
Video-language pre-training can generalize to various vision and language tasks.
Video-language pre-training frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning.
New generation of egocentric video-language pre-training incorporates cross-modal fusion directly into the video and language backbones.
arXiv Detail & Related papers (2023-07-11T17:50:15Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - Connecting Language and Vision for Natural Language-Based Vehicle
Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest.
To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model.
Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.