Related papers: FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

URL: http://arxiv.org/abs/2512.25008v2
Date: Thu, 01 Jan 2026 17:02:51 GMT
Title: FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM
Authors: Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai,
Abstract summary: FoundationSLAM is a learning-based monocular dense SLAM system for accurate and robust tracking and mapping.<n>Our core idea is to bridge flow estimation with reasoning by leveraging the guidance from foundation depth models.
Score: 50.9765003472032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

Related papers

Keyframe-Based Feed-Forward Visual Odometry [13.646685343885556]
Current foundation model based methods typically process raw image sequences indiscriminately.<n>We propose a novel feed-forward VO method that employs reinforcement learning to derive an adaptive visual policy in a data-driven manner.<n> Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
arXiv Detail & Related papers (2026-01-22T14:45:42Z)
Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision [12.972772139292957]
This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision.<n>By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method.
arXiv Detail & Related papers (2025-11-13T13:51:16Z)
Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance [61.41904916189093]
We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images.<n>We use hand-object interaction as geometric guidance to ensure plausible hand-object interactions.
arXiv Detail & Related papers (2025-08-25T17:11:53Z)
JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting [10.690965024885358]
Reconstructing 3D scenes from sparse viewpoints is a long-standing challenge with wide applications.<n>Recent advances in feed-forward 3D Gaussian sparse-view reconstruction methods provide an efficient solution for real-time novel view synthesis.<n>We propose JointSplat, a unified framework that leverages the complementarity between optical flow and depth.
arXiv Detail & Related papers (2025-06-04T12:04:40Z)
Depth Anything with Any Prior [64.39991799606146]
Prior Depth Anything is a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction.<n>We develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors.<n>Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets.
arXiv Detail & Related papers (2025-05-15T17:59:50Z)
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion [53.70278210626701]
We propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images.<n>Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame.<n>We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches.
arXiv Detail & Related papers (2025-05-08T17:59:47Z)
DuCos: Duality Constrained Depth Super-Resolution via Foundation Model [56.88399488384106]
We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory.<n>DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts.
arXiv Detail & Related papers (2025-03-06T07:36:45Z)
Relative Pose Estimation through Affine Corrections of Monocular Depth Priors [69.59216331861437]
We develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities.<n>We propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints.
arXiv Detail & Related papers (2025-01-09T18:58:30Z)
Exploiting Correspondences with All-pairs Correlations for Multi-view Depth Estimation [19.647670347925754]
Multi-view depth estimation plays a critical role in reconstructing and understanding the 3D world. We design a novel iterative multi-view depth estimation framework mimicking the optimization process. We conduct sufficient experiments on ScanNet, DeMoN, ETH3D, and 7Scenes to demonstrate the superiority of our method.
arXiv Detail & Related papers (2022-05-05T07:38:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.