DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
- URL: http://arxiv.org/abs/2512.12799v1
- Date: Sun, 14 Dec 2025 18:45:54 GMT
- Title: DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
- Authors: Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao,
- Abstract summary: We propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action framework.<n>Our method jointly performs spatial understanding, 3D perception, prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel.<n>With only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models.
- Score: 94.62097655403683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI
Related papers
- D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z) - AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model [40.488271586857884]
AndesVL is a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders.<n>We introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning framework to facilitate efficient task adaptation and model compression.<n>We achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips.
arXiv Detail & Related papers (2025-10-13T15:04:38Z) - InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency [255.9417257812203]
InternVL 3.5 is a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency.<n>Key innovation is the Cascade Reinforcement Learning framework, which enhances reasoning through a two-stage process.<n>Our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks.
arXiv Detail & Related papers (2025-08-25T17:58:17Z) - Tracking Meets Large Multimodal Models for Driving Scenario Understanding [76.71815464110153]
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research.<n>We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details.<n>We introduce a novel approach for embedding this tracking information into LMMs to enhance their understanding of driving scenarios.
arXiv Detail & Related papers (2025-03-18T17:59:12Z) - PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models [20.256394783857676]
PiSA-Engine is a framework for generating instruction point-language datasets enriched with 3D spatial semantics.<n>We introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels.<n> Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification.
arXiv Detail & Related papers (2025-03-13T16:37:26Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones.
We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers.
EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z) - PerMO: Perceiving More at Once from a Single Image for Autonomous
Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image.
Our approach combines the strengths of deep learning and the elegance of traditional techniques.
We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.