Related papers: Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos

URL: http://arxiv.org/abs/2602.22091v2
Date: Wed, 04 Mar 2026 19:25:25 GMT
Title: Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Authors: Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan,
Abstract summary: Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving.<n>We propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos.
Score: 20.73513310337503
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.

Related papers

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving [26.557803260279258]
Cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models inherently lack this capability.<n>We propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous driving.
arXiv Detail & Related papers (2026-02-24T11:33:44Z)
InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation [53.47253633654885]
InstaDrive is a novel framework that enhances driving video realism through two key advancements.<n>By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality.<n>Our project page is https://shanpoyang654.io/InstaDrive/page.html.
arXiv Detail & Related papers (2026-02-03T08:22:13Z)
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z)
Spatial Retrieval Augmented Autonomous Driving [81.39665750557526]
Existing autonomous driving systems rely on onboard sensors for environmental perception.<n>We propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input.<n>We will open-source dataset curation code, data, and benchmarks for further study of this new autonomous driving paradigm.
arXiv Detail & Related papers (2025-12-07T14:40:49Z)
DriveVGGT: Visual Geometry Transformer for Autonomous Driving [50.5036123750788]
DriveVGGT is a scale-aware 4D reconstruction framework specifically designed for autonomous driving data.<n>We propose a temporal Video Attention (TVA) module to process multi-camera videos independently.<n>Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings.
arXiv Detail & Related papers (2025-11-27T09:40:43Z)
Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving [7.921556303360947]
We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving.<n>Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving.<n> Empirically, our method achieves the state-of-the-art performance on the nuScenes dataset.
arXiv Detail & Related papers (2025-09-29T05:14:18Z)
LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z)
Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling [96.31941517446859]
We propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only.
arXiv Detail & Related papers (2023-01-03T08:52:49Z)
CARNet: A Dynamic Autoencoder for Learning Latent Dynamics in Autonomous Driving Tasks [11.489187712465325]
An autonomous driving system should effectively use the information collected from the various sensors in order to form an abstract description of the world. Deep learning models, such as autoencoders, can be used for that purpose, as they can learn compact latent representations from a stream of incoming data. This work proposes CARNet, a Combined dynAmic autoencodeR NETwork architecture that utilizes an autoencoder combined with a recurrent neural network to learn the current latent representation.
arXiv Detail & Related papers (2022-05-18T04:15:42Z)
Self-Supervised Pillar Motion Learning for Autonomous Driving [10.921208239968827]
We propose a learning framework that leverages free supervisory signals from point clouds and paired camera images to estimate motion purely via self-supervision. Our model involves a point cloud based structural consistency augmented with probabilistic motion masking as well as a cross-sensor motion regularization to realize the desired self-supervision.
arXiv Detail & Related papers (2021-04-18T02:32:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.