MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
- URL: http://arxiv.org/abs/2510.27234v1
- Date: Fri, 31 Oct 2025 06:54:27 GMT
- Title: MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
- Authors: Jingnan Gao, Zhe Wang, Xianze Fang, Xingyu Ren, Zhuo Chen, Shengqi Liu, Yuhao Cheng, Jiangjing Lyu, Xiaokang Yang, Yichao Yan,
- Abstract summary: MoRE is a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture.<n>MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation.<n>It integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction.
- Score: 50.37005070020306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
Related papers
- Enhanced Mixture 3D CGAN for Completion and Generation of 3D Objects [0.2624902795082451]
The generation and completion of 3D objects represent a transformative challenge in computer vision.<n>In this paper, we investigate the integration of Deep 3D Convolutional GANs with a MoE framework to generate high-quality 3D models.
arXiv Detail & Related papers (2026-02-08T16:32:41Z) - Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z) - RAISECity: A Multimodal Agent Framework for Reality-Aligned 3D World Generation at City-Scale [27.629200433201472]
RAISECity is a textbfReality-textbfAligned textbfIntelligent textbfSynthesis textbfEngine that creates detailed, textbfCity-scale 3D worlds.
arXiv Detail & Related papers (2025-11-22T10:09:22Z) - ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation [44.75113949778924]
ARTDECO is a unified framework that combines the efficiency of feed-forward models with the reliability of SLAM-based pipelines.<n>We show that ARTDECO delivers interactive performance comparable to SLAM, robustness similar to feed-forward systems, and reconstruction quality close to per-scene optimization.
arXiv Detail & Related papers (2025-10-09T17:57:38Z) - DGS-LRM: Real-Time Deformable 3D Gaussian Reconstruction From Monocular Videos [52.46386528202226]
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM)<n>It is the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene.<n>It achieves performance on par with state-of-the-art monocular video 3D tracking methods.
arXiv Detail & Related papers (2025-06-11T17:59:58Z) - GTR: Gaussian Splatting Tracking and Reconstruction of Unknown Objects Based on Appearance and Geometric Complexity [49.31257173003408]
We present a novel method for 6-DoF object tracking and high-quality 3D reconstruction from monocular RGBD video.<n>Our approach demonstrates strong capabilities in recovering high-fidelity object meshes, setting a new standard for single-sensor 3D reconstruction in open-world environments.
arXiv Detail & Related papers (2025-05-17T08:46:29Z) - SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting [4.121797302827049]
We propose SelfSplat, a novel 3D Gaussian Splatting model designed to perform pose-free and 3D prior-free generalizable 3D reconstruction from unposed multi-view images.<n>Our model addresses these challenges by effectively integrating explicit 3D representations with self-supervised depth and pose estimation techniques.<n>To present the performance of our method, we evaluated it on large-scale real-world datasets, including RealEstate10K, ACID, and DL3DV.
arXiv Detail & Related papers (2024-11-26T08:01:50Z) - GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation [65.33726478659304]
We introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory.
Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images.
GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms.
arXiv Detail & Related papers (2024-06-21T17:49:31Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Robust Geometry-Preserving Depth Estimation Using Differentiable
Rendering [93.94371335579321]
We propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations.
Comprehensive experiments underscore our framework's superior generalization capabilities.
Our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients.
arXiv Detail & Related papers (2023-09-18T12:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.