Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving
- URL: http://arxiv.org/abs/2506.05442v1
- Date: Thu, 05 Jun 2025 12:59:35 GMT
- Title: Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving
- Authors: Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang,
- Abstract summary: Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities.<n>Existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy.<n>This paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations.
- Score: 29.019907345552475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.
Related papers
- SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving [51.47621083057114]
SOLVE is an innovative framework that synergizes Vision-Language Models with end-to-end (E2E) models to enhance autonomous vehicle planning.<n>Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components.
arXiv Detail & Related papers (2025-05-22T15:44:30Z) - VERDI: VLM-Embedded Reasoning for Autonomous Driving [33.66777025242027]
We propose VLM-Embedded Reasoning for autonomous Driving (VERDI)<n>VERDI distills reasoning process and commonsense knowledge of Vision-Language Models into an AD stack.<n>We show that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in $ell_2$ distance.
arXiv Detail & Related papers (2025-05-21T18:24:36Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2025-04-06T03:54:21Z) - DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving.<n>Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning.<n>Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z) - DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving [62.62464518137153]
DriveTransformer is a simplified E2E-AD framework for the ease of scaling up.<n>It is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention.<n>It achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
arXiv Detail & Related papers (2025-03-07T11:41:18Z) - VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [20.43366384946928]
Vision-language models (VLMs) as teachers to enhance training.<n>VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
arXiv Detail & Related papers (2024-12-19T01:53:36Z) - LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement [4.534832757549232]
We introduce LaVida Drive, a novel and efficient VQA framework for autonomous driving.<n>LaVida Drive seamlessly integrates temporal data while maintaining high-resolution inputs for detailed visual perception.<n>It optimize spatial processing by retaining high-resolution data for intricate details and using lower-resolution inputs for temporal analysis.
arXiv Detail & Related papers (2024-11-20T02:14:07Z) - Efficient Driving Behavior Narration and Reasoning on Edge Device Using Large Language Models [16.532357621144342]
Large language models (LLMs) can describe driving scenes and behaviors with a level of accuracy similar to human perception.
We propose a driving behavior narration and reasoning framework that applies LLMs to edge devices.
Our experiments show that LLMs deployed on edge devices can achieve satisfactory response speeds.
arXiv Detail & Related papers (2024-09-30T15:03:55Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning [68.45848423501927]
We propose a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning.<n>Our approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving [38.28159034562901]
Reason2Drive is a benchmark dataset with over 600K video-text pairs.
We characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps.
We introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems.
arXiv Detail & Related papers (2023-12-06T18:32:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.