SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
- URL: http://arxiv.org/abs/2505.16805v1
- Date: Thu, 22 May 2025 15:44:30 GMT
- Title: SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
- Authors: Xuesong Chen, Linjiang Huang, Tao Ma, Rongyao Fang, Shaoshuai Shi, Hongsheng Li,
- Abstract summary: SOLVE is an innovative framework that synergizes Vision-Language Models with end-to-end (E2E) models to enhance autonomous vehicle planning.<n>Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components.
- Score: 51.47621083057114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
Related papers
- VLMPlanner: Integrating Visual Language Models with Motion Planning [18.633637485218802]
VLMPlanner is a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images.<n>We develop the Context-Adaptive Inference Gate mechanism that enables the VLM to mimic human driving behavior.
arXiv Detail & Related papers (2025-07-27T16:15:21Z) - LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving [48.607991747956255]
We present LeAD, a dual-rate autonomous driving architecture integrating imitation learning-based end-to-end (E2E) frameworks with large language model (LLM) augmentation.<n>Our experimental evaluation in the CARLA Simulator demonstrates LeAD's superior handling of unconventional scenarios, achieving 71 points on Leaderboard V1 benchmark, with a route completion of 93%.
arXiv Detail & Related papers (2025-07-08T07:58:29Z) - VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process [40.3578745624081]
We propose a vision-language autonomous driving model, which integrates a fine-tuned Visual Language Models (VLMs) with a state-of-the-art end-to-end system.<n>We implement a specialized fine-tuning approach using custom question-answer datasets designed specifically to improve the spatial reasoning capabilities of the model.<n>Our system produces interpretable natural language explanations of driving decisions, thereby increasing transparency and trustworthiness of the traditionally black-box end-to-end architecture.
arXiv Detail & Related papers (2025-07-02T01:52:40Z) - Open-Source LLM-Driven Federated Transformer for Predictive IoV Management [1.8024397171920885]
Federated Prompt-d Traffic Transformer (FPoTT) is a novel framework that leverages open-source Large Language Models for predictive IoV management.<n>FPoTT introduces a dynamic prompt optimization mechanism that iteratively refines textual prompts to enhance trajectory prediction.<n>The architecture employs a dual-layer federated learning paradigm, combining lightweight edge models for real-time inference with cloud-based LLMs to retain global intelligence.
arXiv Detail & Related papers (2025-05-01T16:54:21Z) - RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving [10.984203470464687]
Vision-language models (VLMs) often suffer from limitations such as inadequate spatial perception and hallucination.<n>We propose a retrieval-augmented decision-making (RAD) framework to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes.<n>We fine-tune VLMs on a dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities.
arXiv Detail & Related papers (2025-03-18T03:25:57Z) - DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z) - VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion [5.6565850326929485]
We propose a novel framework that uses Vision-Language Models to enhance training by providing attentional cues.<n>Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision.<n>We evaluate VLM-E2E on the nuScenes dataset and demonstrate its superiority over state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-25T10:02:12Z) - Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving [5.456780031044544]
We propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks.<n>We apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes.
arXiv Detail & Related papers (2025-01-12T01:31:07Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - Semantic Communication for Cooperative Perception using HARQ [51.148203799109304]
We leverage an importance map to distill critical semantic information, introducing a cooperative perception semantic communication framework.
To counter the challenges posed by time-varying multipath fading, our approach incorporates the use of frequency-division multiplexing (OFDM) along with channel estimation and equalization strategies.
We introduce a novel semantic error detection method that is integrated with our semantic communication framework in the spirit of hybrid automatic repeated request (HARQ)
arXiv Detail & Related papers (2024-08-29T08:53:26Z) - V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models [13.716889927164383]
Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving.<n>This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs)<n>V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments.
arXiv Detail & Related papers (2024-08-17T16:42:13Z) - Unified End-to-End V2X Cooperative Autonomous Driving [21.631099800753795]
UniE2EV2X is a V2X-integrated end-to-end autonomous driving system that consolidates key driving modules within a unified network.
The framework employs a deformable attention-based data fusion strategy, effectively facilitating cooperation between vehicles and infrastructure.
We implement the UniE2EV2X framework on the challenging DeepAccident, a simulation dataset designed for V2X cooperative driving.
arXiv Detail & Related papers (2024-05-07T03:01:40Z) - VLP: Vision Language Planning for Autonomous Driving [52.640371249017335]
This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving.
It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
arXiv Detail & Related papers (2024-01-10T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.