SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization
- URL: http://arxiv.org/abs/2508.17972v1
- Date: Mon, 25 Aug 2025 12:38:26 GMT
- Title: SAIL-Recon: Large SfM by Augmenting Scene Regression with Localization
- Authors: Junyuan Deng, Heng Li, Tao Xie, Weiqiang Ren, Qian Zhang, Ping Tan, Xiaoyang Guo,
- Abstract summary: We introduce SAIL-Recon, a feed-forward Transformer for large scale SfM.<n>Our method first computes a neural scene representation from a subset of anchor images.<n>The regression network is then fine-tuned to reconstruct all input images conditioned on this neural scene representation.
- Score: 33.31942454376888
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene regression methods, such as VGGT, solve the Structure-from-Motion (SfM) problem by directly regressing camera poses and 3D scene structures from input images. They demonstrate impressive performance in handling images under extreme viewpoint changes. However, these methods struggle to handle a large number of input images. To address this problem, we introduce SAIL-Recon, a feed-forward Transformer for large scale SfM, by augmenting the scene regression network with visual localization capabilities. Specifically, our method first computes a neural scene representation from a subset of anchor images. The regression network is then fine-tuned to reconstruct all input images conditioned on this neural scene representation. Comprehensive experiments show that our method not only scales efficiently to large-scale scenes, but also achieves state-of-the-art results on both camera pose estimation and novel view synthesis benchmarks, including TUM-RGBD, CO3Dv2, and Tanks & Temples. We will publish our model and code. Code and models are publicly available at: https://hkust-sail.github.io/ sail-recon/.
Related papers
- UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction [73.29048162438797]
We introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model.<n>Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images.<n>Experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction.
arXiv Detail & Related papers (2025-10-02T04:50:18Z) - ZeroGS: Training 3D Gaussian Splatting from Unposed Images [62.34149221132978]
We propose ZeroGS to train 3DGS from hundreds of unposed and unordered images.
Our method leverages a pretrained foundation model as the neural scene representation.
Our method recovers more accurate camera poses than state-of-the-art pose-free NeRF/3DGS methods.
arXiv Detail & Related papers (2024-11-24T11:20:48Z) - No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images.
Our model achieves real-time 3D Gaussian reconstruction during inference.
This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z) - GLACE: Global Local Accelerated Coordinate Encoding [66.87005863868181]
Scene coordinate regression methods are effective in small-scale scenes but face significant challenges in large-scale scenes.
We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network.
Our method achieves state-of-the-art results on large-scale scenes with a low-map-size model.
arXiv Detail & Related papers (2024-06-06T17:59:50Z) - Scene Coordinate Reconstruction: Posing of Image Collections via Incremental Learning of a Relocalizer [21.832249148699397]
We address the task of estimating camera parameters from a set of images depicting a scene.
We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images.
arXiv Detail & Related papers (2024-04-22T17:02:33Z) - 3D Reconstruction with Generalizable Neural Fields using Scene Priors [71.37871576124789]
We introduce training generalizable Neural Fields incorporating scene Priors (NFPs)
The NFP network maps any single-view RGB-D image into signed distance and radiance values.
A complete scene can be reconstructed by merging individual frames in the volumetric space WITHOUT a fusion module.
arXiv Detail & Related papers (2023-09-26T18:01:02Z) - SACReg: Scene-Agnostic Coordinate Regression for Visual Localization [16.866303169903237]
We propose a generalized SCR model trained once in new test scenes, regardless of their scale, without any finetuning.
Instead of encoding the scene coordinates into the network weights, our model takes as input a database image with some sparse 2D pixel to 3D coordinate annotations.
We show that the database representation of images and their 2D-3D annotations can be highly compressed with negligible loss of localization performance.
arXiv Detail & Related papers (2023-07-21T16:56:36Z) - RUST: Latent Neural Scene Representations from Unposed Imagery [21.433079925439234]
Inferring structure of 3D scenes from 2D observations is a fundamental challenge in computer vision.
Recent popularized approaches based on neural scene representations have achieved tremendous impact.
RUST (Really Unposed Scene representation Transformer) is a pose-free approach to novel view trained on RGB images alone.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - One-Shot Neural Fields for 3D Object Understanding [112.32255680399399]
We present a unified and compact scene representation for robotics.
Each object in the scene is depicted by a latent code capturing geometry and appearance.
This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction, and stable grasp prediction.
arXiv Detail & Related papers (2022-10-21T17:33:14Z) - ViewFormer: NeRF-free Neural Rendering from Few Images Using
Transformers [34.4824364161812]
Novel view synthesis is a problem where we are given only a few context views sparsely covering a scene or an object.
The goal is to predict novel viewpoints in the scene, which requires learning priors.
We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network.
arXiv Detail & Related papers (2022-03-18T21:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.