Any Resolution Any Geometry: From Multi-View To Multi-Patch
- URL: http://arxiv.org/abs/2603.03026v1
- Date: Tue, 03 Mar 2026 14:18:24 GMT
- Title: Any Resolution Any Geometry: From Multi-View To Multi-Patch
- Authors: Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi, Xiangjun Tang, Peter Wonka,
- Abstract summary: The Ultra Resolution Geometry Transformer (URGT) adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation.<n>A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs.<n>Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.3
- Score: 46.8276232687626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
Related papers
- UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction [30.233380858586454]
UrbanGS is a scalable reconstruction framework for city-scale applications.<n>It addresses issues of geometric consistency, memory efficiency, and computational scalability.<n>Experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency.
arXiv Detail & Related papers (2026-02-02T13:35:16Z) - COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation [58.47973015036709]
3D pose estimation from sparse multi-views is a critical task for action recognition, sports analysis, and human-robot interaction.<n>We propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph problem.<n> COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods.
arXiv Detail & Related papers (2026-01-14T18:50:17Z) - MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly [62.48017648785026]
We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles.<n>We show that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference.
arXiv Detail & Related papers (2025-09-24T11:02:03Z) - Dens3R: A Foundation Model for 3D Geometry Prediction [44.13431776180547]
Dens3R is a 3D foundation model designed for joint geometric dense prediction.<n>By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities.
arXiv Detail & Related papers (2025-07-22T07:22:30Z) - DVP-MVS++: Synergize Depth-Normal-Edge and Harmonized Visibility Prior for Multi-View Stereo [7.544716770845737]
We propose DVP-MVS++, an innovative approach that synergizes both depth-normal-edge aligned and harmonized cross-view priors for robust and visibility-aware patch deformation.<n> Evaluation results on ETH3D, Tanks & Temples and Strecha datasets exhibit the state-of-the-art performance and robust generalization capability of our proposed method.
arXiv Detail & Related papers (2025-06-16T08:15:22Z) - SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling [79.56581753856452]
SparseFlex is a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $10243$ directly from rendering losses.<n>By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
arXiv Detail & Related papers (2025-03-27T17:46:42Z) - MSP-MVS: Multi-Granularity Segmentation Prior Guided Multi-View Stereo [8.303396507129266]
MSP-MVS is a method introducing multi-granularity segmentation prior to edge-confined patch deformation.<n>We implement equidistribution and disassemble-clustering of correlative reliable pixels.<n>We also introduce disparity-sampling synergistic 3D optimization to help identify global-minimum matching costs.
arXiv Detail & Related papers (2024-07-27T19:00:44Z) - SGFormer: Spherical Geometry Transformer for 360 Depth Estimation [52.23806040289676]
Panoramic distortion poses a significant challenge in 360 depth estimation.<n>We propose a spherical geometry transformer, named SGFormer, to address the above issues.<n>We also present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions.
arXiv Detail & Related papers (2024-04-23T12:36:24Z) - 360 Layout Estimation via Orthogonal Planes Disentanglement and Multi-view Geometric Consistency Perception [56.84921040837699]
Existing panoramic layout estimation solutions tend to recover room boundaries from a vertically compressed sequence, yielding imprecise results.
We propose an orthogonal plane disentanglement network (termed DOPNet) to distinguish ambiguous semantics.
We also present an unsupervised adaptation technique tailored for horizon-depth and ratio representations.
Our solution outperforms other SoTA models on both monocular layout estimation and multi-view layout estimation tasks.
arXiv Detail & Related papers (2023-12-26T12:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.