Related papers: Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

URL: http://arxiv.org/abs/2311.01017v4
Date: Mon, 1 Apr 2024 15:41:50 GMT
Title: Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion
Authors: Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, Raquel Urtasun,
Abstract summary: Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
Score: 36.321494200830244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose Copilot4D, a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer as discrete diffusion and enhance it with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, Copilot4D reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.

Related papers

MinD: Unified Visual Imagination and Control via Hierarchical World Models [32.08769443927576]
Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics.<n>Manipulate in Dream (MinD) is a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation.<n>MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction.
arXiv Detail & Related papers (2025-06-23T17:59:06Z)
AMPLIFY: Actionless Motion Priors for Robot Learning from Videos [29.799207502031496]
We introduce AMPLIFY, a novel framework that leverages large-scale video data.<n>We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples.<n>In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data.
arXiv Detail & Related papers (2025-06-17T05:31:42Z)
Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models [34.71460539414284]
We introduce a novel and highly effective self-supervised learning approach to learn both large- and small-scale features. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.
arXiv Detail & Related papers (2025-02-13T14:21:03Z)
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency. Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z)
Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models [60.87795376541144]
A world model is a neural network capable of predicting an agent's next state given past states and actions. During end-to-end training, our policy learns how to recover from errors by aligning with states observed in human demonstrations. We present qualitative and quantitative results, demonstrating significant improvements upon prior state of the art in closed-loop testing.
arXiv Detail & Related papers (2024-09-25T06:48:25Z)
Generalizable Implicit Neural Representation As a Universal Spatiotemporal Traffic Data Learner [46.866240648471894]
Spatiotemporal Traffic Data (STTD) measures the complex dynamical behaviors of the multiscale transportation system. We present a novel paradigm to address the STTD learning problem by parameterizing STTD as an implicit neural representation. We validate its effectiveness through extensive experiments in real-world scenarios, showcasing applications from corridor to network scales.
arXiv Detail & Related papers (2024-06-13T02:03:22Z)
DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving [67.46481099962088]
Current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. We introduce emphcentricDriveWorld, which is capable of pre-training from multi-camera driving videos in atemporal fashion. DriveWorld delivers promising results on various autonomous driving tasks.
arXiv Detail & Related papers (2024-05-07T15:14:20Z)
Spatiotemporal Implicit Neural Representation as a Generalized Traffic Data Learner [46.866240648471894]
Spatiotemporal Traffic Data (STTD) measures the complex dynamical behaviors of the multiscale transportation system. We present a novel paradigm to address the STTD learning problem by parameterizing STTD as an implicit neural representation. We validate its effectiveness through extensive experiments in real-world scenarios, showcasing applications from corridor to network scales.
arXiv Detail & Related papers (2024-05-06T06:23:06Z)
Humanoid Locomotion as Next Token Prediction [84.21335675130021]
Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. We show that our model enables a full-sized humanoid to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize commands not seen during training like walking backward.
arXiv Detail & Related papers (2024-02-29T18:57:37Z)
Predictive World Models from Real-World Partial Observations [66.80340484148931]
We present a framework for learning a probabilistic predictive world model for real-world road environments. While prior methods require complete states as ground truth for learning, we present a novel sequential training method to allow HVAEs to learn to predict complete states from partially observed states only.
arXiv Detail & Related papers (2023-01-12T02:07:26Z)
Harnessing expressive capacity of Machine Learning modeling to represent complex coupling of Earth's auroral space weather regimes [0.0]
We develop multiple Deep Learning (DL) models that advance predictions of the global auroral particle precipitation. We use observations from low Earth orbiting spacecraft of electron energy flux to develop a model that improves global nowcasts. Notably, the ML models improve prediction of the extreme events, historically to accurate specification and indicate that increased capacity provided by ML innovation can address grand challenges in science of space weather.
arXiv Detail & Related papers (2021-11-29T22:35:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.