VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes
- URL: http://arxiv.org/abs/2509.06685v3
- Date: Wed, 10 Sep 2025 01:06:59 GMT
- Title: VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes
- Authors: Shengkai Zhang, Yuhe Liu, Guanjun Wu, Jianhua He, Xinggang Wang, Mozi Chen, Kezhong Liu,
- Abstract summary: VIM-GS is a framework using monocular images for novel-view synthesis (NVS) in large scenes.<n>This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering.
- Score: 42.946624621872274
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.
Related papers
- Masked Depth Modeling for Spatial Perception [44.0326843862591]
LingBot-Depth is a depth completion model that refines depth maps through masked depth modeling.<n>It outperforms top-tier RGB-D cameras in terms of both depth precision and pixel coverage.<n>We release the code, checkpoint, and 3M RGB-depth pairs to the community of spatial perception.
arXiv Detail & Related papers (2026-01-25T16:13:49Z) - StarryGazer: Leveraging Monocular Depth Estimation Models for Domain-Agnostic Single Depth Image Completion [56.28564075246147]
StarryGazer is a framework that predicts dense depth images from a single sparse depth image and an RGB image.<n>We employ a pre-trained MDE model to produce relative depth images.<n>A refinement network is trained with the synthetic pairs, incorporating the relative depth maps and RGB images to improve the model's accuracy and robustness.
arXiv Detail & Related papers (2025-12-15T09:56:09Z) - Revisiting Monocular 3D Object Detection with Depth Thickness Field [44.4805861813093]
We present MonoDTF, a scene-to-instance depth-adapted network for monocular 3D object detection.<n>The framework mainly comprises a Scene-Level Depth Retargeting (SDR) module and an Instance-Level Spatial Refinement (ISR) module.<n>The latter refines the voxel space with the guidance of instances, enhancing the 3D instance-aware capability of the depth thickness field.
arXiv Detail & Related papers (2024-12-26T10:51:50Z) - Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion [51.69876947593144]
Existing methods for depth completion operate in tightly constrained settings.<n>Inspired by advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation.<n>Marigold-DC builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance.
arXiv Detail & Related papers (2024-12-18T00:06:41Z) - Depth-Relative Self Attention for Monocular Depth Estimation [23.174459018407003]
deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information.
We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention.
We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information.
arXiv Detail & Related papers (2023-04-25T14:20:31Z) - Crafting Monocular Cues and Velocity Guidance for Self-Supervised
Multi-Frame Depth Learning [22.828829870704006]
Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects.
In contrast, multi-frame depth estimation methods improve the depth accuracy thanks to the success of Multi-View Stereo.
We propose MOVEDepth, which exploits the MOnocular cues and VE guidance to improve multi-frame depth learning.
arXiv Detail & Related papers (2022-08-19T06:32:06Z) - Improving Monocular Visual Odometry Using Learned Depth [84.05081552443693]
We propose a framework to exploit monocular depth estimation for improving visual odometry (VO)
The core of our framework is a monocular depth estimation module with a strong generalization capability for diverse scenes.
Compared with current learning-based VO methods, our method demonstrates a stronger generalization ability to diverse scenes.
arXiv Detail & Related papers (2022-04-04T06:26:46Z) - MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [57.969536140562674]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.<n>We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.<n>On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.