A Single Image and Multimodality Is All You Need for Novel View Synthesis
- URL: http://arxiv.org/abs/2602.17909v1
- Date: Fri, 20 Feb 2026 00:13:11 GMT
- Title: A Single Image and Multimodality Is All You Need for Novel View Synthesis
- Authors: Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi,
- Abstract summary: We show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome limitations of diffusion-based approaches.<n>We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR.<n>Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference.
- Score: 8.273110298367644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.
Related papers
- MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference [83.38607296779423]
We show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting.<n>Our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis.
arXiv Detail & Related papers (2025-10-13T13:29:20Z) - JointSplat: Probabilistic Joint Flow-Depth Optimization for Sparse-View Gaussian Splatting [10.690965024885358]
Reconstructing 3D scenes from sparse viewpoints is a long-standing challenge with wide applications.<n>Recent advances in feed-forward 3D Gaussian sparse-view reconstruction methods provide an efficient solution for real-time novel view synthesis.<n>We propose JointSplat, a unified framework that leverages the complementarity between optical flow and depth.
arXiv Detail & Related papers (2025-06-04T12:04:40Z) - MonoInstance: Enhancing Monocular Priors via Multi-view Instance Alignment for Neural Rendering and Reconstruction [45.70946415376022]
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis.<n>Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision.<n>We propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors.
arXiv Detail & Related papers (2025-03-24T05:58:06Z) - Multi-view Reconstruction via SfM-guided Monocular Depth Estimation [92.89227629434316]
We present a new method for multi-view geometric reconstruction.<n>We incorporate SfM information, a strong multi-view prior, into the depth estimation process.<n>Our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works.
arXiv Detail & Related papers (2025-03-18T17:54:06Z) - TensoIR: Tensorial Inverse Rendering [51.57268311847087]
TensoIR is a novel inverse rendering approach based on tensor factorization and neural fields.
TensoRF is a state-of-the-art approach for radiance field modeling.
arXiv Detail & Related papers (2023-04-24T21:39:13Z) - DeLiRa: Self-Supervised Depth, Light, and Radiance Fields [32.350984950639656]
Differentiable volumetric rendering is a powerful paradigm for 3D reconstruction and novel view synthesis.
Standard volume rendering approaches struggle with degenerate geometries in the case of limited viewpoint diversity.
In this work, we propose to use the multi-view photometric objective as a geometric regularizer for volumetric rendering.
arXiv Detail & Related papers (2023-04-06T00:16:25Z) - MonoSDF: Exploring Monocular Geometric Cues for Neural Implicit Surface
Reconstruction [72.05649682685197]
State-of-the-art neural implicit methods allow for high-quality reconstructions of simple scenes from many input views.
This is caused primarily by the inherent ambiguity in the RGB reconstruction loss that does not provide enough constraints.
Motivated by recent advances in the area of monocular geometry prediction, we explore the utility these cues provide for improving neural implicit surface reconstruction.
arXiv Detail & Related papers (2022-06-01T17:58:15Z) - InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering [55.70938412352287]
We present an information-theoretic regularization technique for few-shot novel view synthesis based on neural implicit representation.
The proposed approach minimizes potential reconstruction inconsistency that happens due to insufficient viewpoints.
We achieve consistently improved performance compared to existing neural view synthesis methods by large margins on multiple standard benchmarks.
arXiv Detail & Related papers (2021-12-31T11:56:01Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.