Related papers: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

URL: http://arxiv.org/abs/2502.15672v1
Date: Fri, 21 Feb 2025 18:56:02 GMT
Title: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
Authors: Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord,
Abstract summary: We introduce an open-source auto-regressive video model (VaM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving.<n>We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving.
Score: 88.33638585518226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel

Related papers

LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model [22.92353994818742]
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
arXiv Detail & Related papers (2025-06-02T11:19:23Z)
An Empirical Study of Autoregressive Pre-training from Videos [67.15356613065542]
We treat videos as visual tokens and train transformer models to autoregressively predict future tokens.<n>Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens.<n>Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance.
arXiv Detail & Related papers (2025-01-09T18:59:58Z)
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving [12.004604110512421]
Vision language models (VLMs) are emerging as revolutionary tools with significant potential to influence autonomous driving. We propose the DriveGenVLM framework to generate driving videos and use VLMs to understand them.
arXiv Detail & Related papers (2024-08-29T15:52:56Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators [46.40277880351059]
We explore utilizing visual signals as a new interface for models to interact with the environment. We find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario. Results show that our models can generate high-quality video clips that accurately align with the semantic guidance provided by the demonstration videos.
arXiv Detail & Related papers (2024-07-10T04:27:06Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
Model-Based Imitation Learning for Urban Driving [26.782783239210087]
We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment.
arXiv Detail & Related papers (2022-10-14T11:59:46Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.