Related papers: Visual Implicit Geometry Transformer for Autonomous Driving

Visual Implicit Geometry Transformer for Autonomous Driving

URL: http://arxiv.org/abs/2602.05573v1
Date: Thu, 05 Feb 2026 11:54:38 GMT
Title: Visual Implicit Geometry Transformer for Autonomous Driving
Authors: Arsenii Shirokov, Mikhail Kuznetsov, Danila Stepochkin, Egor Evdokimov, Daniil Glazkov, Nikolay Patakin, Anton Konushin, Dmitry Senushkin,
Abstract summary: We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model.<n>ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements.<n>We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets.
Score: 7.795200422563638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.

Related papers

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving [26.557803260279258]
Cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models inherently lack this capability.<n>We propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous driving.
arXiv Detail & Related papers (2026-02-24T11:33:44Z)
GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio [62.07995406671134]
We present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories.<n>GA-Drive synthesizes novel pseudo-views using geometry information.<n>These pseudo-views are then transformed into photorealistic views using a trained video diffusion model.
arXiv Detail & Related papers (2026-02-24T08:22:42Z)
GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss [15.633839321933385]
Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction.<n>These models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes.<n>We propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments.
arXiv Detail & Related papers (2026-01-23T16:46:59Z)
DVGT: Driving Visual Geometry Transformer [63.38483879291505]
A driving-targeted dense geometry perception model can adapt to different scenarios and camera configurations.<n>We propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs.<n>DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations.
arXiv Detail & Related papers (2025-12-18T18:59:57Z)
TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction [57.46712611558817]
3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass.<n>Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry.<n>We propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies.
arXiv Detail & Related papers (2025-12-02T02:22:20Z)
DriveVGGT: Visual Geometry Transformer for Autonomous Driving [50.5036123750788]
DriveVGGT is a scale-aware 4D reconstruction framework specifically designed for autonomous driving data.<n>We propose a temporal Video Attention (TVA) module to process multi-camera videos independently.<n>Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings.
arXiv Detail & Related papers (2025-11-27T09:40:43Z)
UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment [22.92093036869778]
We present UNO, a unified visual odometry framework that enables robust and pose estimation across diverse environments.<n>Our approach generalizes effectively across a wide range of real-world scenarios, including autonomous vehicles, aerial drones, mobile robots, and handheld devices.<n>We extensively evaluate our method on three major benchmark datasets.
arXiv Detail & Related papers (2025-06-08T06:30:37Z)
VGGT: Visual Geometry Grounded Transformer [61.37669770946458]
VGGT is a feed-forward neural network that directly infers all key 3D attributes of a scene.<n>Network achieves state-of-the-art results in multiple 3D tasks.
arXiv Detail & Related papers (2025-03-14T17:59:47Z)
Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z)
Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net. We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)
PerMO: Perceiving More at Once from a Single Image for Autonomous Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image. Our approach combines the strengths of deep learning and the elegance of traditional techniques. We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.