Related papers: V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

URL: http://arxiv.org/abs/2408.09251v3
Date: Thu, 19 Jun 2025 05:02:13 GMT
Title: V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models
Authors: Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, Bin Ran,
Abstract summary: Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving.<n>This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs)<n>V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments.
Score: 13.716889927164383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving by leveraging information from both ego-vehicle and infrastructure sensors. However, effectively fusing heterogeneous visual and semantic information while ensuring robust trajectory planning remains a significant challenge. This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs). V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments. Specifically, we propose a contrastive learning-based mechanism to reinforce the alignment of heterogeneous visual and textual characteristics, which enhances the semantic understanding of complex driving scenarios, and employ a knowledge distillation strategy to stabilize training. Experiments on a large real-world dataset demonstrate that V2X-VLM achieves state-of-the-art trajectory planning accuracy, significantly reducing L2 error and collision rate compared to existing cooperative autonomous driving baselines. Ablation studies validate the contributions of each component. Moreover, the evaluation of robustness and efficiency highlights the practicality of V2X-VLM for real-world deployment to enhance overall autonomous driving safety and decision-making.

Related papers

Research Challenges and Progress in the End-to-End V2X Cooperative Autonomous Driving Competition [57.698383942708]
Vehicle-to-everything (V2X) communication has emerged as a key enabler for extending perception range and enhancing driving safety.<n>We organized the End-to-End Autonomous Driving through V2X Cooperation Challenge, which features two tracks: cooperative temporal perception and cooperative end-to-end planning.<n>This paper describes the design and outcomes of the challenge, highlights key research problems including bandwidth-aware fusion, robust multi-agent planning, and heterogeneous sensor integration.
arXiv Detail & Related papers (2025-07-29T09:06:40Z)
V2X-UniPool: Unifying Multimodal Perception and Knowledge Reasoning for Autonomous Driving [13.181643929201666]
V2X-UniPool is a unified framework that integrates multimodal Vehicle-to-Everything (V2X) data into a time-indexed and language-based knowledge pool.<n>Our system enables ADs to perform accurate, temporally consistent reasoning over both static environment and dynamic traffic context.
arXiv Detail & Related papers (2025-06-03T08:00:57Z)
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving [51.47621083057114]
SOLVE is an innovative framework that synergizes Vision-Language Models with end-to-end (E2E) models to enhance autonomous vehicle planning.<n>Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components.
arXiv Detail & Related papers (2025-05-22T15:44:30Z)
Towards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework [62.47416496137193]
We propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. The architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime.
arXiv Detail & Related papers (2025-03-06T07:36:06Z)
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion [5.6565850326929485]
We propose a novel framework that uses Vision-Language Models to enhance training by providing attentional cues.<n>Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision.<n>We evaluate VLM-E2E on the nuScenes dataset and demonstrate its superiority over state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-25T10:02:12Z)
LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z)
Hybrid-Generative Diffusion Models for Attack-Oriented Twin Migration in Vehicular Metaverses [58.264499654343226]
Vehicle Twins (VTs) are digital twins that provide immersive virtual services for Vehicular Metaverse Users (VMUs) High mobility of vehicles, uneven deployment of edge servers, and potential security threats pose challenges to achieving efficient and reliable VT migrations. We propose a secure and reliable VT migration framework in vehicular metaverses.
arXiv Detail & Related papers (2024-07-05T11:11:33Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
Unified End-to-End V2X Cooperative Autonomous Driving [21.631099800753795]
UniE2EV2X is a V2X-integrated end-to-end autonomous driving system that consolidates key driving modules within a unified network. The framework employs a deformable attention-based data fusion strategy, effectively facilitating cooperation between vehicles and infrastructure. We implement the UniE2EV2X framework on the challenging DeepAccident, a simulation dataset designed for V2X cooperative driving.
arXiv Detail & Related papers (2024-05-07T03:01:40Z)
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [31.552397390480525]
We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. We propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline.
arXiv Detail & Related papers (2024-02-19T17:04:04Z)
VLP: Vision Language Planning for Autonomous Driving [52.640371249017335]
This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
arXiv Detail & Related papers (2024-01-10T23:00:40Z)
V2X-Lead: LiDAR-based End-to-End Autonomous Driving with Vehicle-to-Everything Communication Integration [4.166623313248682]
This paper presents a LiDAR-based end-to-end autonomous driving method with Vehicle-to-Everything (V2X) communication integration. The proposed method aims to handle imperfect partial observations by fusing the onboard LiDAR sensor and V2X communication data.
arXiv Detail & Related papers (2023-09-26T20:26:03Z)
Generative AI-empowered Simulation for Autonomous Driving in Vehicular Mixed Reality Metaverses [130.15554653948897]
In vehicular mixed reality (MR) Metaverse, distance between physical and virtual entities can be overcome. Large-scale traffic and driving simulation via realistic data collection and fusion from the physical world is difficult and costly. We propose an autonomous driving architecture, where generative AI is leveraged to synthesize unlimited conditioned traffic and driving data in simulations.
arXiv Detail & Related papers (2023-02-16T16:54:10Z)
COOPERNAUT: End-to-End Driving with Cooperative Perception for Networked Vehicles [54.61668577827041]
We introduce COOPERNAUT, an end-to-end learning model that uses cross-vehicle perception for vision-based cooperative driving. Our experiments on AutoCastSim suggest that our cooperative perception driving models lead to a 40% improvement in average success rate.
arXiv Detail & Related papers (2022-05-04T17:55:12Z)
V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents. V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention. To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z)
V2X-Sim: A Virtual Collaborative Perception Dataset for Autonomous Driving [26.961213523096948]
Vehicle-to-everything (V2X) denotes the collaboration between a vehicle and any entity in its surrounding. We present the V2X-Sim dataset, the first public large-scale collaborative perception dataset in autonomous driving.
arXiv Detail & Related papers (2022-02-17T05:14:02Z)
VISTA 2.0: An Open, Data-driven Simulator for Multimodal Sensing and Policy Learning for Autonomous Vehicles [131.2240621036954]
We present VISTA, an open source, data-driven simulator that integrates multiple types of sensors for autonomous vehicles. Using high fidelity, real-world datasets, VISTA represents and simulates RGB cameras, 3D LiDAR, and event-based cameras. We demonstrate the ability to train and test perception-to-control policies across each of the sensor types and showcase the power of this approach via deployment on a full scale autonomous vehicle.
arXiv Detail & Related papers (2021-11-23T18:58:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.