SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
- URL: http://arxiv.org/abs/2412.09401v3
- Date: Sun, 23 Mar 2025 17:01:39 GMT
- Title: SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
- Authors: Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, Baoquan Chen,
- Abstract summary: SLAM3R is a novel system for real-time, high-quality, dense 3D reconstruction using RGB videos.<n>It seamlessly integrates local 3D reconstruction and global coordinate registration through feed-forward neural networks.<n>It achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS.
- Score: 33.57444419305241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce SLAM3R, a novel and effective system for real-time, high-quality, dense 3D reconstruction using RGB videos. SLAM3R provides an end-to-end solution by seamlessly integrating local 3D reconstruction and global coordinate registration through feed-forward neural networks. Given an input video, the system first converts it into overlapping clips using a sliding window mechanism. Unlike traditional pose optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB images in each window and progressively aligns and deforms these local pointmaps to create a globally consistent scene reconstruction - all without explicitly solving any camera parameters. Experiments across datasets consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy and completeness while maintaining real-time performance at 20+ FPS. Code available at: https://github.com/PKU-VCL-3DV/SLAM3R.
Related papers
- St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World [106.91539872943864]
St4RTrack is a framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs.
We predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry.
We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
arXiv Detail & Related papers (2025-04-17T17:55:58Z) - PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM [105.01907579424362]
PanoSLAM is the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework.
For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video.
arXiv Detail & Related papers (2024-12-31T08:58:10Z) - HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction [38.47566815670662]
HI-SLAM2 is a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input.<n>We demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality.
arXiv Detail & Related papers (2024-11-27T01:39:21Z) - Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians [87.48403838439391]
3D Splatting has emerged as a powerful representation of geometry and appearance for RGB-only dense Simultaneous SLAM.
We propose the first RGB-only SLAM system with a dense 3D Gaussian map representation.
Our experiments on the Replica, TUM-RGBD, and ScanNet datasets indicate the effectiveness of globally optimized 3D Gaussians.
arXiv Detail & Related papers (2024-05-26T12:26:54Z) - GlORIE-SLAM: Globally Optimized RGB-only Implicit Encoding Point Cloud SLAM [53.6402869027093]
We propose an efficient RGB-only dense SLAM system using a flexible neural point cloud representation scene.
We also introduce a novel DSPO layer for bundle adjustment which optimize the pose and depth of implicits along with the scale of the monocular depth.
arXiv Detail & Related papers (2024-03-28T16:32:06Z) - Loopy-SLAM: Dense Neural SLAM with Loop Closures [53.11936461015725]
We introduce Loopy-SLAM that globally optimize poses and the dense 3D model.
We use frame-to-model tracking using a data-driven point-based submap generation method and trigger loop closures online by performing global place recognition.
Evaluation on the synthetic Replica and real-world TUM-RGBD and ScanNet datasets demonstrate competitive or superior performance in tracking, mapping, and rendering accuracy when compared to existing dense neural RGBD SLAM methods.
arXiv Detail & Related papers (2024-02-14T18:18:32Z) - GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction [45.49960166785063]
GO-SLAM is a deep-learning-based dense visual SLAM framework globally optimizing poses and 3D reconstruction in real-time.
Results on various synthetic and real-world datasets demonstrate that GO-SLAM outperforms state-of-the-art approaches at tracking robustness and reconstruction accuracy.
arXiv Detail & Related papers (2023-09-05T17:59:58Z) - NICER-SLAM: Neural Implicit Scene Encoding for RGB SLAM [111.83168930989503]
NICER-SLAM is a dense RGB SLAM system that simultaneously optimize for camera poses and a hierarchical neural implicit map representation.
We show strong performance in dense mapping, tracking, and novel view synthesis, even competitive with recent RGB-D SLAM systems.
arXiv Detail & Related papers (2023-02-07T17:06:34Z) - ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of
Signed Distance Fields [2.0625936401496237]
ESLAM reads RGB-D frames with unknown camera poses in a sequential manner and incrementally reconstructs the scene representation.
ESLAM improves the accuracy of 3D reconstruction and camera localization of state-of-the-art dense visual SLAM methods by more than 50%.
arXiv Detail & Related papers (2022-11-21T18:25:14Z) - NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video [41.554961144321474]
We propose to reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network.
A learning-based TSDF fusion module is used to guide the network to fuse features from previous fragments.
Experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed.
arXiv Detail & Related papers (2021-04-01T17:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.