OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
- URL: http://arxiv.org/abs/2511.10560v2
- Date: Fri, 14 Nov 2025 04:11:26 GMT
- Title: OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
- Authors: Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu,
- Abstract summary: General 3D foundation models have started to lead the trend of unifying diverse vision tasks.<n>We introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary modalities during both training and inference.
- Score: 48.16706802780516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model's representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
Related papers
- Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z) - Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer [40.778996326009185]
We present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism.<n>Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry.<n>We propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention.
arXiv Detail & Related papers (2025-12-26T06:12:17Z) - SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models [13.88629412035865]
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning.<n>We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs.
arXiv Detail & Related papers (2025-11-28T11:04:21Z) - Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding [11.222744122842023]
We introduce a plug-and-play module that implicitly incorporates 3D geometry features into Vision-Language-Action (VLA) models.<n>Our method significantly improves the performance of state-of-the-art VLA models across diverse scenarios.
arXiv Detail & Related papers (2025-07-01T04:05:47Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [49.81255045696323]
We present the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet)<n>AuxDet integrates metadata semantics with visual features, guiding adaptive representation learning for each sample.<n>Experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - Extrapolated Urban View Synthesis Benchmark [53.657271730352214]
Photo simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs)<n>At their core is Novel View Synthesis (NVS), a capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs.<n>Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes.<n>We will release the data to help advance self-driving and urban robotics simulation technology.
arXiv Detail & Related papers (2024-12-06T18:41:39Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.