MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
- URL: http://arxiv.org/abs/2603.02351v1
- Date: Mon, 02 Mar 2026 19:49:25 GMT
- Title: MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
- Authors: Leo Kaixuan Cheng, Abdus Shaikh, Ruofan Liang, Zhijie Wu, Yushi Guan, Nandita Vijaykumar,
- Abstract summary: We introduce MERG3R, a training-free divide-and-conquer framework for geometric foundation models.<n> MERG3R partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently.<n>It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure.<n>Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability.
- Score: 6.060187129166582
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advancements in neural visual geometry, including transformer-based models such as VGGT and Pi3, have achieved impressive accuracy on 3D reconstruction tasks. However, their reliance on full attention makes them fundamentally limited by GPU memory capacity, preventing them from scaling to large, unordered image collections. We introduce MERG3R, a training-free divide-and-conquer framework that enables geometric foundation models to operate far beyond their native memory limits. MERG3R first reorders and partitions unordered images into overlapping, geometrically diverse subsets that can be reconstructed independently. It then merges the resulting local reconstructions through an efficient global alignment and confidence-weighted bundle adjustment procedure, producing a globally consistent 3D model. Our framework is model-agnostic and can be paired with existing neural geometry models. Across large-scale datasets, including 7-Scenes, NRGBD, Tanks & Temples, and Cambridge Landmarks, MERG3R consistently improves reconstruction accuracy, memory efficiency, and scalability, enabling high-quality reconstruction when the dataset exceeds memory capacity limits.
Related papers
- LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory [97.14005794889134]
We present LoGeR, a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization.<n>LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning.<n>This memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference.
arXiv Detail & Related papers (2026-03-03T18:55:37Z) - MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes [0.3823356975862005]
Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability.<n>We propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy.
arXiv Detail & Related papers (2026-02-05T19:54:30Z) - S-MUSt3R: Sliding Multi-view 3D Reconstruction [17.018626984951823]
This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction.<n>We show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction.
arXiv Detail & Related papers (2026-02-04T13:07:14Z) - SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images [31.94503176488054]
SaLon3R is a novel framework for Structure-aware, Long-term 3DGS Reconstruction.<n>It is capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal.<n>Our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass.
arXiv Detail & Related papers (2025-10-16T18:37:10Z) - DiMeR: Disentangled Mesh Reconstruction Model [29.827345186012558]
DiMeR is a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction.<n>We streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision.<n>Extensive experiments demonstrate that DiMeR generalises across sparse-view-, single-image-, and text-to-3D tasks, consistently outperforming baselines.
arXiv Detail & Related papers (2025-04-24T15:39:20Z) - MUSt3R: Multi-view Network for Stereo 3D Reconstruction [11.61182864709518]
We propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns.<n>We entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity.<n>The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios.
arXiv Detail & Related papers (2025-03-03T15:36:07Z) - Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation [65.33726478659304]
We introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory.
Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images.
GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms.
arXiv Detail & Related papers (2024-06-21T17:49:31Z) - GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation [85.15374487533643]
We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s.
GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information.
We also showcase the potential of GRM in generative tasks, i.e., text-to-3D and image-to-3D, by integrating it with existing multi-view diffusion models.
arXiv Detail & Related papers (2024-03-21T17:59:34Z) - Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability [118.26563926533517]
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space.
We extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously.
arXiv Detail & Related papers (2024-02-19T15:33:09Z) - GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction [50.248694764703714]
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction.
These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization.
We propose Greedy LEarning for Accelerated MRI reconstruction, an efficient training strategy for high-dimensional imaging settings.
arXiv Detail & Related papers (2022-07-18T06:01:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.