FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
- URL: http://arxiv.org/abs/2512.09373v1
- Date: Wed, 10 Dec 2025 07:11:22 GMT
- Title: FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement
- Authors: Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng,
- Abstract summary: F is the first feed-forward multiview registration transformer that processes all scans in a unified, compact latent space.<n>F predicts global poses without any pairwise estimation.<n>Experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
- Score: 39.19949818461193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.
Related papers
- ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting [1.1470070927586018]
ProFuse is an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS)<n>The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup.<n>ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene.
arXiv Detail & Related papers (2026-01-08T09:20:46Z) - Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding [86.55824709875598]
We propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches.<n>Unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor to capture fine-grained 3D shape details.<n>We employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations.
arXiv Detail & Related papers (2026-01-05T18:33:50Z) - econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS.<n>Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z) - Diff-Reg v2: Diffusion-Based Matching Matrix Estimation for Image Matching and 3D Registration [44.88739897482003]
We introduce an innovative paradigm that leverages a diffusion model in matrix space for robust matching matrix estimation.<n>Specifically, we apply the diffusion model in the doubly matrix space for 3D-3D and 2D-3D registration tasks.<n>For all three registration tasks, we provide adaptive matching matrix embedding implementations tailored to the specific characteristics of each task.
arXiv Detail & Related papers (2025-03-06T06:13:27Z) - DiHuR: Diffusion-Guided Generalizable Human Reconstruction [51.31232435994026]
We introduce DiHuR, a Diffusion-guided model for generalizable Human 3D Reconstruction and view synthesis from sparse, minimally overlapping images.<n>Our method integrates two key priors in a coherent manner: the prior from generalizable feed-forward models and the 2D diffusion prior, and it requires only multi-view image training, without 3D supervision.
arXiv Detail & Related papers (2024-11-16T03:52:23Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D
Object Pose Estimation [66.16525145765604]
We introduce an SE(3) diffusion model-based point cloud registration framework for 6D object pose estimation in real-world scenarios.
Our approach formulates the 3D registration task as a denoising diffusion process, which progressively refines the pose of the source point cloud.
Experiments demonstrate that our diffusion registration framework presents outstanding pose estimation performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.
arXiv Detail & Related papers (2023-10-26T12:47:26Z) - DPCN++: Differentiable Phase Correlation Network for Versatile Pose
Registration [18.60311260250232]
We present a differentiable phase correlation solver that is globally convergent and correspondence-free.
We evaluate DCPN++ on a wide range of registration tasks taking different input modalities, including 2D bird's-eye view images, 3D object and scene measurements, and medical images.
arXiv Detail & Related papers (2022-06-12T10:00:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.