Related papers: DriveVGGT: Visual Geometry Transformer for Autonomous Driving

DriveVGGT: Visual Geometry Transformer for Autonomous Driving

URL: http://arxiv.org/abs/2511.22264v1
Date: Thu, 27 Nov 2025 09:40:43 GMT
Title: DriveVGGT: Visual Geometry Transformer for Autonomous Driving
Authors: Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong, Junchi Yan,
Abstract summary: DriveVGGT is a scale-aware 4D reconstruction framework specifically designed for autonomous driving data.<n>We propose a temporal Video Attention (TVA) module to process multi-camera videos independently.<n>Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings.
Score: 50.5036123750788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.

Related papers

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos [20.73513310337503]
Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving.<n>We propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos.
arXiv Detail & Related papers (2026-02-25T16:38:53Z)
Visual Implicit Geometry Transformer for Autonomous Driving [7.795200422563638]
We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model.<n>ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements.<n>We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets.
arXiv Detail & Related papers (2026-02-05T11:54:38Z)
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z)
DVGT: Driving Visual Geometry Transformer [63.38483879291505]
A driving-targeted dense geometry perception model can adapt to different scenarios and camera configurations.<n>We propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs.<n>DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations.
arXiv Detail & Related papers (2025-12-18T18:59:57Z)
DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving [9.882070476776274]
We present a generalizable camera simulation framework DriveCamSim.<n>Our core innovation lies in the proposed Explicit Camera Modeling mechanism.<n>For controllable generation, we identify the issue of information loss inherent in existing conditional encoding and injection pipelines.
arXiv Detail & Related papers (2025-05-26T08:50:15Z)
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)
Multi-camera Bird's Eye View Perception for Autonomous Driving [17.834495597639805]
It is essential to produce perception outputs in 3D to enable the spatial reasoning of other agents and structures. The most basic approach to achieving the desired BEV representation from a camera image is IPM, assuming a flat ground surface. More recent approaches use deep neural networks to output directly in BEV space.
arXiv Detail & Related papers (2023-09-16T19:12:05Z)
VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection [17.22491199725569]
Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure. We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI) VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C.
arXiv Detail & Related papers (2023-03-20T09:56:17Z)
SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.