DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving
- URL: http://arxiv.org/abs/2602.14577v1
- Date: Mon, 16 Feb 2026 09:13:52 GMT
- Title: DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving
- Authors: Chenxu Dang, Sining Ang, Yongkang Li, Haochen Tian, Jie Wang, Guang Li, Hangjun Ye, Jie Ma, Long Chen, Yan Wang,
- Abstract summary: Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization.<n> Token-based planners are plagued by cumulative causal errors and irreversible decoding.<n>We propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities.
- Score: 14.800134964871875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models for autonomous driving increasingly adopt generative planners trained with imitation learning followed by reinforcement learning. Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization. Token-based planners are plagued by cumulative causal errors and irreversible decoding. In summary, the two dominant paradigms exhibit complementary strengths and weaknesses. In this paper, we propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities. In particular, we design a novel plug-and-play block-MoE, which seamlessly injects a refinement expert on top of the generation expert. By enabling explicit expert selection during inference and gradient blocking during training, the two experts are fully decoupled, preserving the foundational capabilities and generic patterns of the pretrained weights, which highlights the flexibility and extensibility of the block-MoE design. Furthermore, we design a hybrid reinforcement learning strategy that encourages effective exploration of refinement expert while maintaining training stability. Extensive experiments on NAVSIM v1, v2, and Navhard benchmarks demonstrate that DriveFine exhibits strong efficacy and robustness. The code will be released at https://github.com/MSunDYY/DriveFine.
Related papers
- MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning [51.20229133553804]
Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL)<n>Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning.<n>We propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters.<n>By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions.
arXiv Detail & Related papers (2025-12-15T18:31:32Z) - DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving [65.7087560656003]
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse.<n>We propose DiffusionDriveV2, which leverages reinforcement learning to constrain low-quality modes and explore for superior trajectories.<n>This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model.
arXiv Detail & Related papers (2025-12-08T17:29:52Z) - Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation [20.106116218594266]
DIVER is an end-to-end autonomous driving framework that integrates reinforcement learning and diffusion-based generation.<n>We show that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
arXiv Detail & Related papers (2025-07-05T14:19:19Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [58.559985503802054]
Vision-language-action (VLA) models combine end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training.<n>The most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference.<n>Recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads.<n>We show that naively including such experts significantly harms both training speed and knowledge transfer.
arXiv Detail & Related papers (2025-05-29T17:40:09Z) - Learning Soft Driving Constraints from Vectorized Scene Embeddings while Imitating Expert Trajectories [16.666811573117613]
The primary goal of motion planning is to generate safe and efficient trajectories for vehicles.<n>Traditionally, motion planning models are trained using imitation learning to mimic the behavior of human experts.<n>We propose a method that integrates constraint learning into imitation learning by extracting driving constraints from expert trajectories.
arXiv Detail & Related papers (2024-12-07T18:29:28Z) - Boosting Offline Reinforcement Learning for Autonomous Driving with
Hierarchical Latent Skills [37.31853034449015]
We present a skill-based framework that enhances offline RL to overcome the long-horizon vehicle planning challenge.
Specifically, we design a variational autoencoder (VAE) to learn skills from offline demonstrations.
To mitigate posterior collapse of common VAEs, we introduce a two-branch sequence encoder to capture both discrete options and continuous variations of the complex driving skills.
arXiv Detail & Related papers (2023-09-24T11:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.