Related papers: H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction

URL: http://arxiv.org/abs/2508.03118v1
Date: Tue, 05 Aug 2025 05:56:30 GMT
Title: H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction
Authors: Heng Jia, Linchao Zhu, Na Zhao,
Abstract summary: H3R is a hybrid framework that integrates latent fusion with attention-based feature aggregation.<n>By integrating both paradigms, our approach enhances generalization while converging 2$times$ faster than existing methods.<n>Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization.
Score: 39.22287224290769
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Pl\"ucker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2$\times$ faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at https://github.com/JiaHeng-DLUT/H3R.

Related papers

Dens3R: A Foundation Model for 3D Geometry Prediction [44.13431776180547]
Dens3R is a 3D foundation model designed for joint geometric dense prediction.<n>By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities.
arXiv Detail & Related papers (2025-07-22T07:22:30Z)
Hi^2-GSLoc: Dual-Hierarchical Gaussian-Specific Visual Relocalization for Remote Sensing [6.997091164331322]
Visual relocalization is fundamental to remote sensing and UAV applications.<n>Existing methods face inherent trade-offs: image-based retrieval and pose regression approaches lack precision.<n>We introduce $mathrmHi2$-GSLoc, a dual-hierarchical relocalization framework that follows a sparse-to-dense and coarse-to-fine paradigm.
arXiv Detail & Related papers (2025-07-21T14:47:56Z)
Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization [27.509109317973817]
3D Gaussian Splatting (3DGS) has garnered significant attention for its high-quality rendering and fast inference speed.<n>Previous methods primarily focus on geometry regularization, with common approaches including primitive-based and dual-model frameworks.<n>We propose CarGS, a unified model leveraging-adaptive regularization to achieve simultaneous, high-quality surface reconstruction.
arXiv Detail & Related papers (2025-03-02T12:51:38Z)
Implicit Gaussian Splatting with Efficient Multi-Level Tri-Plane Representation [45.582869951581785]
Implicit Gaussian Splatting (IGS) is an innovative hybrid model that integrates explicit point clouds with implicit feature embeddings. We introduce a level-based progressive training scheme, which incorporates explicit spatial regularization. Our algorithm can deliver high-quality rendering using only a few MBs, effectively balancing storage efficiency and rendering fidelity.
arXiv Detail & Related papers (2024-08-19T14:34:17Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction [48.86083272054711]
latentSplat is a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
arXiv Detail & Related papers (2024-03-24T20:48:36Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM optimization [6.886220026399106]
We introduce Multi-View Stereo (SD-MVS) to tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes. We propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths.
arXiv Detail & Related papers (2024-01-12T05:25:57Z)
Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation. We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds. We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z)
Dense Hybrid Recurrent Multi-view Stereo Net with Dynamic Consistency Checking [54.58791377183574]
Our novel hybrid recurrent multi-view stereo net consists of two core modules: 1) a light DRENet (Dense Reception Expanded) module to extract dense feature maps of original size with multi-scale context information, 2) a HU-LSTM (Hybrid U-LSTM) to regularize 3D matching volume into predicted depth map. Our method exhibits competitive performance to the state-of-the-art method while dramatically reduces memory consumption, which costs only $19.4%$ of R-MVSNet memory consumption.
arXiv Detail & Related papers (2020-07-21T14:59:59Z)
Dense Non-Rigid Structure from Motion: A Manifold Viewpoint [162.88686222340962]
Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames. We show that our approach significantly improves accuracy, scalability, and robustness against noise.
arXiv Detail & Related papers (2020-06-15T09:15:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.