Related papers: VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

URL: http://arxiv.org/abs/2602.20794v1
Date: Tue, 24 Feb 2026 11:33:44 GMT
Title: VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
Authors: Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen,
Abstract summary: Cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models inherently lack this capability.<n>We propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous driving.
Score: 26.557803260279258
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.

Related papers

Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos [20.73513310337503]
Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving.<n>We propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos.
arXiv Detail & Related papers (2026-02-25T16:38:53Z)
Visual Implicit Geometry Transformer for Autonomous Driving [7.795200422563638]
We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model.<n>ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements.<n>We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets.
arXiv Detail & Related papers (2026-02-05T11:54:38Z)
Spatial-aware Vision Language Model for Autonomous Driving [16.149511148218497]
Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models.<n>Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies.<n>We propose LVLDrive, a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving.
arXiv Detail & Related papers (2025-12-30T16:35:00Z)
D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction [86.82819259860186]
We introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning.<n>VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding.
arXiv Detail & Related papers (2025-05-26T17:56:30Z)
DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation [10.296670127024045]
DriveScape is an end-to-end framework for multi-view, 3D condition-guided video generation. Our Bi-Directional Modulated Transformer (BiMot) ensures precise alignment of 3D structural information. DriveScape excels in video generation performance, achieving state-of-the-art results on the nuScenes dataset with an FID score of 8.34 and an FVD score of 76.39.
arXiv Detail & Related papers (2024-09-09T09:43:17Z)
VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving [25.03216574230919]
We propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create 3D vehicle assets for autonomous driving. VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction. We conduct experiments on various datasets, including Pascal 3D+, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-07-09T03:09:55Z)
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [66.6886931183372]
We introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks.
arXiv Detail & Related papers (2024-05-28T16:57:44Z)
HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z)
Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images. Our approach is fully automatic without any human interaction. We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.