UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs
- URL: http://arxiv.org/abs/2511.01768v1
- Date: Mon, 03 Nov 2025 17:24:19 GMT
- Title: UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs
- Authors: Zhe Liu, Jinghua Hou, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai,
- Abstract summary: UniLION efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences.<n>UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks.
- Score: 115.8554707376344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION
Related papers
- UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving [34.278528623978204]
UniDriveDreamer is a single-stage unified multimodal world model for autonomous driving.<n>It generates multimodal future observations without relying on intermediate representations or cascaded modules.<n>It outperforms previous state-of-the-art methods in both video and LiDAR generation.
arXiv Detail & Related papers (2026-02-02T12:02:27Z) - HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving [47.368036613468455]
We present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving.<n>Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames.<n>Our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module.
arXiv Detail & Related papers (2025-11-10T13:49:59Z) - RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving [63.882827922267666]
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD)<n>We propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, while performing a broad spectrum of AD tasks, including perception, prediction, and planning.<n>We conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks.
arXiv Detail & Related papers (2024-12-10T17:27:32Z) - LION: Linear Group RNN for 3D Object Detection in Point Clouds [85.97541374148508]
We propose a window-based framework built on LInear grOup RNN for accurate 3D object detection, called LION.
We introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features.
To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features.
arXiv Detail & Related papers (2024-07-25T17:50:32Z) - Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception [17.11366229887873]
We introduce a unified pretraining strategy, NeRF-Supervised Masked Auto (NS-MAE)
NS-MAE exploits NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data.
Results: NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality.
arXiv Detail & Related papers (2024-05-28T08:13:49Z) - Towards Transferable Multi-modal Perception Representation Learning for
Autonomy: NeRF-Supervised Masked AutoEncoder [1.90365714903665]
This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning.
We show that the representation learned via NeRF-Supervised Masked AutoEncoder (NS-MAE) shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models.
We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.
arXiv Detail & Related papers (2023-11-23T00:53:11Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving [15.36416000750147]
We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion.
MSeg3D still shows robustness and improves the LiDAR-only baseline.
arXiv Detail & Related papers (2023-03-15T13:13:03Z) - HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for
Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians.
Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin.
Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.