Temporal Triplane Transformers as Occupancy World Models
- URL: http://arxiv.org/abs/2503.07338v2
- Date: Thu, 15 May 2025 08:04:19 GMT
- Title: Temporal Triplane Transformers as Occupancy World Models
- Authors: Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Yisen Zhao, Yonghong Tian,
- Abstract summary: T$3$Former is a novel 4D occupancy world model for autonomous driving.<n>It achieves 1.44$times$ speedup, improves mean IoU to 36.09, and reduces mean absolute planning error to 1.0 meters.
- Score: 39.25159533295149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models aim to learn or construct representations of the environment that enable the prediction of future scenes, thereby supporting intelligent motion planning. However, existing models often struggle to produce fine-grained predictions and to operate in real time. In this work, we propose T$^3$Former, a novel 4D occupancy world model for autonomous driving. T$^3$Former begins by pre-training a compact {\em triplane} representation that efficiently encodes 3D occupancy. It then extracts multi-scale temporal motion features from historical triplanes and employs an autoregressive approach to iteratively predict future triplane changes. Finally, these triplane changes are combined with previous states to decode future occupancy and ego-motion trajectories. Experimental results show that T$^3$Former achieves 1.44$\times$ speedup (26 FPS), improves mean IoU to 36.09, and reduces mean absolute planning error to 1.0 meters. Demos are available in the supplementary material.
Related papers
- World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model [18.56171397212777]
We present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models.<n>World4Drive achieves state-of-the-art performance without manual perception annotations on open-loop nuScenes and closed-loop NavSim benchmarks.
arXiv Detail & Related papers (2025-07-01T09:36:38Z) - LMPOcc: 3D Semantic Occupancy Prediction Utilizing Long-Term Memory Prior from Historical Traversals [4.970345700893879]
Longterm Memory Prior Occupancy (LMPOcc) is the first 3D occupancy prediction methodology that exploits long-term memory priors derived from historical perceptual outputs.
We introduce a plug-and-play architecture that integrates long-term memory priors to enhance local perception while simultaneously constructing global occupancy representations.
arXiv Detail & Related papers (2025-04-18T09:58:48Z) - EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation [59.33052312107478]
Event cameras offer possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes.
This paper presents EMove, a novel event-based framework that models-uniform trajectories via event-guided parametric curves.
For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance.
The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, flows and depth motion fields.
arXiv Detail & Related papers (2025-03-14T13:15:54Z) - Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving [22.832008530490167]
We propose a semi-supervised vision-centric 3D occupancy world model, PreWorld, to leverage the potential of 2D labels.<n>PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.
arXiv Detail & Related papers (2025-02-11T07:12:26Z) - Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles [81.29018359825872]
This paper consolidates a set of good practices to finetune large pretrained models for a real-world task.<n>Specifically, we develop several strategies to account for discrepancies between the synthetic data and real driving data.<n>Our insights lead to effective finetuning that results in a $68.8%$ reduction in FID for novel view synthesis over prior arts.
arXiv Detail & Related papers (2024-12-19T03:39:13Z) - An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training [50.71892161377806]
DFIT-OccWorld is an efficient 3D occupancy world model that leverages decoupled dynamic flow and image-assisted training strategy.<n>Our model forecasts future dynamic voxels by warping existing observations using voxel flow, whereas static voxels are easily obtained through pose transformation.
arXiv Detail & Related papers (2024-12-18T12:10:33Z) - GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction [67.81475355852997]
3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings.<n>We propose a world-model-based framework to exploit the scene evolution for perception.<n>Our framework improves the performance of the single-frame counterpart by over 2% in mIoU without introducing additional computations.
arXiv Detail & Related papers (2024-12-13T18:59:54Z) - OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries.
OPUS incorporates a suite of non-trivial strategies to enhance model performance.
Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z) - Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving [15.100104512786107]
Drive-OccWorld adapts a visioncentric- 4D forecasting world model to end-to-end planning for autonomous driving.<n>We propose injecting flexible action conditions, such as velocity, steering angle, trajectory, and commands, into the world model to enable controllable generation.<n>Our method can generate plausible and controllable 4D occupancy, paving the way for advancements in driving world generation and end-to-end planning.
arXiv Detail & Related papers (2024-08-26T11:53:09Z) - OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving [62.54220021308464]
We propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving.
OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes.
arXiv Detail & Related papers (2024-05-30T17:59:42Z) - DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task.
We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion.
DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z) - Improving Trajectory Prediction in Dynamic Multi-Agent Environment by
Dropping Waypoints [9.385936248154987]
Motion prediction systems must learn spatial and temporal information from the past to forecast the future trajectories of the agent.
We propose Temporal Waypoint Dropping (TWD) that explicitly incorporates temporal dependencies during the training of a trajectory prediction model.
We evaluate our proposed approach on three datasets: NBA Sports VU, ETH-UCY, and TrajNet++.
arXiv Detail & Related papers (2023-09-29T15:48:35Z) - Pedestrian 3D Bounding Box Prediction [83.7135926821794]
We focus on 3D bounding boxes, which are reasonable estimates of humans without modeling complex motion details for autonomous vehicles.
We suggest this new problem and present a simple yet effective model for pedestrians' 3D bounding box prediction.
This method follows an encoder-decoder architecture based on recurrent neural networks.
arXiv Detail & Related papers (2022-06-28T17:59:45Z) - Graph-based Spatial Transformer with Memory Replay for Multi-future
Pedestrian Trajectory Prediction [13.466380808630188]
We propose a model to forecast multiple paths based on a historical trajectory.
Our method can exploit the spatial information as well as correct the temporally inconsistent trajectories.
Our experiments show that the proposed model achieves state-of-the-art performance on multi-future prediction and competitive results for single-future prediction.
arXiv Detail & Related papers (2022-06-12T10:25:12Z) - A Spatio-temporal Transformer for 3D Human Motion Prediction [39.31212055504893]
We propose a Transformer-based architecture for the task of generative modelling of 3D human motion.
We empirically show that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-gressive models.
arXiv Detail & Related papers (2020-04-18T19:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.