Related papers: S-MUSt3R: Sliding Multi-view 3D Reconstruction

S-MUSt3R: Sliding Multi-view 3D Reconstruction

URL: http://arxiv.org/abs/2602.04517v1
Date: Wed, 04 Feb 2026 13:07:14 GMT
Title: S-MUSt3R: Sliding Multi-view 3D Reconstruction
Authors: Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, Jerome Revaud,
Abstract summary: This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction.<n>We show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction.
Score: 17.018626984951823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.

Related papers

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction [47.43504457409347]
tttLRM is a novel large 3D reconstruction model that leverages a Test-Time Training layer.<n>Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer.<n>Online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations.
arXiv Detail & Related papers (2026-02-23T18:59:45Z)
KaoLRM: Repurposing Pre-trained Large Reconstruction Models for Parametric 3D Face Reconstruction [51.67605823241639]
KaoLRM re-targets the learned prior of the Large Reconstruction Model (LRM) for parametric 3D face reconstruction from single-view images.<n> Experiments on both controlled and in-the-wild benchmarks demonstrate that KaoLRM achieves superior reconstruction accuracy and cross-view consistency.
arXiv Detail & Related papers (2026-01-19T05:36:59Z)
AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend [18.645700170943975]
AMB3R is a feed-forward model for dense 3D reconstruction on a metric-scale.<n>We show that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion.
arXiv Detail & Related papers (2025-11-25T14:23:04Z)
MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts [50.37005070020306]
MoRE is a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture.<n>MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.<n>It integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction.
arXiv Detail & Related papers (2025-10-31T06:54:27Z)
STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer [72.88105562624838]
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem.<n>By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios.<n>Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.
arXiv Detail & Related papers (2025-08-14T17:58:05Z)
DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos [52.46386528202226]
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM)<n>It is the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene.<n>It achieves performance on par with state-of-the-art monocular video 3D tracking methods.
arXiv Detail & Related papers (2025-06-11T17:59:58Z)
Regist3R: Incremental Registration with Stereo Foundation Model [22.636140424781455]
Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision.<n>We propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction.<n>We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction.
arXiv Detail & Related papers (2025-04-16T02:46:53Z)
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis [61.1662426227688]
Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization.<n>We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner.
arXiv Detail & Related papers (2025-03-26T02:47:27Z)
UVRM: A Scalable 3D Reconstruction Model from Unposed Videos [68.34221167200259]
Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples.<n>We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose.
arXiv Detail & Related papers (2025-01-16T08:00:17Z)
Learning monocular 3D reconstruction of articulated categories from motion [39.811816510186475]
Video self-supervision forces the consistency of consecutive 3D reconstructions by a motion-based cycle loss. We introduce an interpretable model of 3D template deformations that controls a 3D surface through the displacement of a small number of local, learnable handles. We obtain state-of-the-art reconstructions with diverse shapes, viewpoints and textures for multiple articulated object categories.
arXiv Detail & Related papers (2021-03-30T13:50:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.