Related papers: VLP: Vision Language Planning for Autonomous Driving

VLP: Vision Language Planning for Autonomous Driving

URL: http://arxiv.org/abs/2401.05577v4
Date: Sat, 23 Nov 2024 18:49:18 GMT
Title: VLP: Vision Language Planning for Autonomous Driving
Authors: Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, Liu Ren,
Abstract summary: This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
Score: 52.640371249017335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance, through enhanced scene understanding, several key issues, including lack of reasoning, low generalization performance and long-tail scenarios, still need to be addressed. In this paper, we present VLP, a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\% reduction in terms of average L2 error and collision rates, respectively, compared to the previous best method. Moreover, VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments.

Related papers

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z)
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving [26.413685340816436]
We propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. We present a novel continual learning framework that combines Vision-Language Models with selective memory replay and knowledge distillation.
arXiv Detail & Related papers (2025-02-02T16:27:44Z)
Distilling Multi-modal Large Language Models for Autonomous Driving [64.63127269187814]
Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. We propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios.
arXiv Detail & Related papers (2025-01-16T18:59:53Z)
Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving [20.33096710167997]
generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. Cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.
arXiv Detail & Related papers (2025-01-15T15:20:46Z)
Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [2.0122032639916485]
We analyze effective knowledge distillation of semantic labels to smaller Vision networks. This can be used for the semantic representation of complex scenes for downstream decision-making for planning and control.
arXiv Detail & Related papers (2025-01-12T01:31:07Z)
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training. VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z)
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving [1.727597257312416]
CoVLA (Comprehensive Vision-Language-Action) dataset comprises real-world driving videos spanning more than 80 hours. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems.
arXiv Detail & Related papers (2024-08-19T09:53:49Z)
SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving [15.551625571158056]
We propose an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Our experiments demonstrate that SimpleLLM4AD achieves competitive performance in complex driving scenarios.
arXiv Detail & Related papers (2024-07-31T02:35:33Z)
Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving [43.156632952193966]
Traditional end-to-end driving models suffer from long-tail events due to rare or unseen inputs within their training distributions. We propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge. ToKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model.
arXiv Detail & Related papers (2024-07-01T04:34:50Z)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z)
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [31.552397390480525]
We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. We propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.
arXiv Detail & Related papers (2024-02-19T17:04:04Z)
LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning [65.86754998249224]
We develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner. Our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach.
arXiv Detail & Related papers (2023-12-30T02:53:45Z)
GPT-Driver: Learning to Drive with GPT [47.14350537515685]
We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into a reliable motion planner for autonomous vehicles. We capitalize on the strong reasoning capabilities and generalization potential inherent to Large Language Models (LLMs) We evaluate our approach on the large-scale nuScenes dataset, and extensive experiments substantiate the effectiveness, generalization ability, and interpretability of our GPT-based motion planner.
arXiv Detail & Related papers (2023-10-02T17:59:57Z)
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [84.29836263441136]
This study introduces DriveGPT4, a novel interpretable end-to-end autonomous driving system based on multimodal large language models (MLLMs) DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users.
arXiv Detail & Related papers (2023-10-02T17:59:52Z)
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone [67.13773226242242]
Video-language pre-training can generalize to various vision and language tasks. Video-language pre-training frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning. New generation of egocentric video-language pre-training incorporates cross-modal fusion directly into the video and language backbones.
arXiv Detail & Related papers (2023-07-11T17:50:15Z)
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling. PEVL reformulates discretized object positions and language in a unified language modeling framework. We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest. To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model. Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.