Related papers: Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion

URL: http://arxiv.org/abs/2602.20577v1
Date: Tue, 24 Feb 2026 05:59:10 GMT
Title: Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion
Authors: Jiaru Zhang, Manav Gagvani, Can Cui, Juntong Peng, Ruqi Zhang, Ziran Wang,
Abstract summary: Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD) is a novel framework designed to bridge the gap between efficient planning and semantic explainability.<n>We introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions.<n>Experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision.
Score: 23.834662472392694
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.

Related papers

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models [20.47311573790516]
We propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging.<n>Experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model's original visual capabilities.
arXiv Detail & Related papers (2026-01-29T02:36:19Z)
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z)
LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction [19.57998167905048]
End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations.<n>Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained vision models to address this limitation.<n>We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations.
arXiv Detail & Related papers (2026-01-09T08:06:44Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving [9.719456684859606]
WAM-Diff is a framework that employs masked diffusion to refine a discrete sequence representing future ego-trajectories.<n>Our model achieves 91.0 PDMS on NAVSIM-v1 and 89.7S on NAVSIM-v2, demonstrating the effectiveness of masked diffusion for autonomous driving.
arXiv Detail & Related papers (2025-12-06T10:51:53Z)
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z)
ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving [64.12414815634847]
Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge.<n>We propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer.
arXiv Detail & Related papers (2025-08-15T12:06:55Z)
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z)
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space [74.12387631212609]
We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations.<n>SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models.<n> Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis.
arXiv Detail & Related papers (2025-05-19T14:38:59Z)
Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model.<n>By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.<n>On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.