Related papers: DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

URL: http://arxiv.org/abs/2512.12799v1
Date: Sun, 14 Dec 2025 18:45:54 GMT
Title: DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Authors: Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao,
Abstract summary: We propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action framework.<n>Our method jointly performs spatial understanding, 3D perception, prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel.<n>With only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models.
Score: 94.62097655403683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI

Related papers

D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z)
AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model [40.488271586857884]
AndesVL is a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders.<n>We introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning framework to facilitate efficient task adaptation and model compression.<n>We achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips.
arXiv Detail & Related papers (2025-10-13T15:04:38Z)
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency [255.9417257812203]
InternVL 3.5 is a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency.<n>Key innovation is the Cascade Reinforcement Learning framework, which enhances reasoning through a two-stage process.<n>Our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks.
arXiv Detail & Related papers (2025-08-25T17:58:17Z)
Tracking Meets Large Multimodal Models for Driving Scenario Understanding [76.71815464110153]
Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research.<n>We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details.<n>We introduce a novel approach for embedding this tracking information into LMMs to enhance their understanding of driving scenarios.
arXiv Detail & Related papers (2025-03-18T17:59:12Z)
PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models [20.256394783857676]
PiSA-Engine is a framework for generating instruction point-language datasets enriched with 3D spatial semantics.<n>We introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels.<n> Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification.
arXiv Detail & Related papers (2025-03-13T16:37:26Z)
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion. DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z)
UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z)
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning [19.354515754130592]
We introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers. EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x.
arXiv Detail & Related papers (2022-10-14T13:26:41Z)
SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors. Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z)
PerMO: Perceiving More at Once from a Single Image for Autonomous Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image. Our approach combines the strengths of deep learning and the elegance of traditional techniques. We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.