LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning
- URL: http://arxiv.org/abs/2512.18211v1
- Date: Sat, 20 Dec 2025 04:38:35 GMT
- Title: LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning
- Authors: Yudong Liu, Spencer Hallyburton, Jiwoo Kim, Yueqian Lin, Yiming Li, Qinsi Wang, Hui Ye, Jingwei Sun, Miroslav Pajic, Yiran Chen, Hai Li,
- Abstract summary: Trajectory planning is a fundamental yet challenging component of autonomous driving.<n>We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction.<n>On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task.
- Score: 28.59507336524504
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.
Related papers
- Align2Act: Instruction-Tuned Models for Human-Aligned Autonomous Driving [0.0]
We propose Align2Act, a motion planning framework that transforms instruction-tuned language models into interpretable planners aligned with human behavior.<n>By fine-tuning LLaMA-2-7B with LoRA on one million scenarios from the nuPlan dataset, our method achieves an open-loop score of 85.17 and closed-loop scores of 70.31 and 66.96 on Test14-random.<n>Unlike prior work focused on synthetic or open-loop settings, we demonstrate improved planning quality and human-likeness on the real-world nuPlan closed-loop benchmark.
arXiv Detail & Related papers (2025-10-12T08:50:34Z) - VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z) - Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving [19.48508500497233]
We present Poutine, a method that uses an off-the-shelf vision-language model (VLM) to achieve robust end-to-end autonomous driving.<n>To learn strong base driving capabilities, we first train Poutine-Base using self-supervised next-token prediction over vision, language, and trajectory (VLT) tokens.<n>The final Poutine model achieves an RFS of 7.99 on the test set, placing 1st in the 2025 Vision-Based End-to-End Driving Challenge.
arXiv Detail & Related papers (2025-06-12T19:14:00Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - LiloDriver: A Lifelong Learning Framework for Closed-loop Motion Planning in Long-tail Autonomous Driving Scenarios [23.913788819453796]
LiloDriver is a lifelong learning framework for closed-loop motion planning in long-tail autonomous driving scenarios.<n>It features a four-stage architecture including perception, scene encoding, memory-based strategy refinement, and LLM-guided reasoning.<n>Our results highlight the effectiveness of combining structured memory and LLM reasoning to enable scalable, human-like motion planning in real-world autonomous driving.
arXiv Detail & Related papers (2025-05-22T18:33:08Z) - Distilling Multi-modal Large Language Models for Autonomous Driving [64.63127269187814]
Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events.<n>We propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM.<n>Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios.
arXiv Detail & Related papers (2025-01-16T18:59:53Z) - Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving [20.33096710167997]
generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving.<n>Cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories.<n>It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems.
arXiv Detail & Related papers (2025-01-15T15:20:46Z) - DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z) - Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z) - VLP: Vision Language Planning for Autonomous Driving [52.640371249017335]
This paper presents a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving.
It achieves state-of-the-art end-to-end planning performance on the NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates, respectively.
arXiv Detail & Related papers (2024-01-10T23:00:40Z) - LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning [65.86754998249224]
We develop a novel hybrid planner that leverages a conventional rule-based planner in conjunction with an LLM-based planner.
Our approach navigates complex scenarios which existing planners struggle with, produces well-reasoned outputs while also remaining grounded through working alongside the rule-based approach.
arXiv Detail & Related papers (2023-12-30T02:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.