Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
- URL: http://arxiv.org/abs/2601.21288v1
- Date: Thu, 29 Jan 2026 05:41:24 GMT
- Title: Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
- Authors: Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Yusen Qin, Kaixuan Wang, Yu Zhang,
- Abstract summary: We propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad.<n>We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines.<n>Experiments show that our distilled InternVL3-1B model, with 42 times less GPU memory and 11.4 times higher throughput, achieves better overall performance than the pretrained 78B model.
- Score: 26.97190983537793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.
Related papers
- DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving [14.800134964871875]
Diffusion-based planners suffer from modality alignment difficulties, low training efficiency, and limited generalization.<n> Token-based planners are plagued by cumulative causal errors and irreversible decoding.<n>We propose DriveFine, a masked diffusion VLA model that combines flexible decoding with self-correction capabilities.
arXiv Detail & Related papers (2026-02-16T09:13:52Z) - PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models [51.43746425777865]
Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks.<n>We propose PILOT, a framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance.
arXiv Detail & Related papers (2026-01-07T12:38:56Z) - Cross-Modal Representational Knowledge Distillation for Enhanced Spike-Informed LFP Modeling [0.0]
latent potentials (LFPs) can be routinely recorded alongside spiking activity in neural experiments.<n>LFPs pose inherent modeling challenges due to their aggregate, population-level nature.<n>We introduce a cross-modal knowledge distillation framework that transfers high-fidelity representational knowledge from pretrained multi-session spike transformer models to LFP transformer models.
arXiv Detail & Related papers (2025-12-13T21:20:13Z) - dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning [69.36145467833498]
We introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving.<n> evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems.
arXiv Detail & Related papers (2025-12-04T05:05:41Z) - Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail [85.47497935739936]
Alpamayo-R1 (AR1) is a vision-language-action model that integrates Chain of Causation reasoning with trajectory planning.<n>We show AR1 achieves 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline.<n>We plan to release AR1 models and a subset of the CoC in a future update.
arXiv Detail & Related papers (2025-10-30T01:25:34Z) - Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving [7.921556303360947]
We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving.<n>Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving.<n> Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset.
arXiv Detail & Related papers (2025-09-29T05:14:18Z) - DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model [23.573720107353868]
We introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model.<n>We employ a planning model based on structured scene representations as the teacher model.<n>We validate our model on the nuScenes and NAVSIM datasets, achieving a 50% reduction in collision rate.
arXiv Detail & Related papers (2025-08-07T13:54:35Z) - ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z) - FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection [33.225938984092274]
We propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies.
We also design two Point Cloud Intensification ( PCI) strategies to compensate for the sparsity of point clouds.
We develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features.
arXiv Detail & Related papers (2024-07-14T09:39:44Z) - FullLoRA: Efficiently Boosting the Robustness of Pretrained Vision Transformers [72.83770102062141]
Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks.<n>Existing large models tend to prioritize performance during training, potentially neglecting the robustness.<n>We develop novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module.<n>We propose the FullLoRA framework by integrating the learnable LNLoRA modules into all key components of ViT-based models.
arXiv Detail & Related papers (2024-01-03T14:08:39Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.