UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis
- URL: http://arxiv.org/abs/2512.20107v1
- Date: Tue, 23 Dec 2025 07:08:00 GMT
- Title: UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis
- Authors: Thanh-Tung Le, Tuan Pham, Tung Nguyen, Deying Kong, Xiaohui Xie, Stephan Mandt,
- Abstract summary: Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views.<n>Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas diffusion-based methods hallucinate plausible content yet incur heavy training-time costs.<n>We propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation.
- Score: 28.245380116188883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Novel view synthesis (NVS) seeks to render photorealistic, 3D-consistent images of a scene from unseen camera poses given only a sparse set of posed views. Existing deterministic networks render observed regions quickly but blur unobserved areas, whereas stochastic diffusion-based methods hallucinate plausible content yet incur heavy training- and inference-time costs. In this paper, we propose a hybrid framework that unifies the strengths of both paradigms. A bidirectional transformer encodes multi-view image tokens and Plucker-ray embeddings, producing a shared latent representation. Two lightweight heads then act on this representation: (i) a feed-forward regression head that renders pixels where geometry is well constrained, and (ii) a masked autoregressive diffusion head that completes occluded or unseen regions. The entire model is trained end-to-end with joint photometric and diffusion losses, without handcrafted 3D inductive biases, enabling scalability across diverse scenes. Experiments demonstrate that our method attains state-of-the-art image quality while reducing rendering time by an order of magnitude compared with fully generative baselines.
Related papers
- PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis [11.578542607236857]
Existing skinning-based avatars require laborious manual rigging or template-based fittings.<n>We present PoseCraft, a diffusion framework built around tokenized 3D interface.<n>Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods.
arXiv Detail & Related papers (2026-02-22T21:50:24Z) - InpaintHuman: Reconstructing Occluded Humans with Multi-Scale UV Mapping and Identity-Preserving Diffusion Inpainting [64.42884719282323]
InpaintHuman is a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos.<n>Our approach employs direct pixel-level supervision to ensure identity fidelity.
arXiv Detail & Related papers (2026-01-05T13:26:02Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - SparseFusion: Distilling View-conditioned Diffusion for 3D
Reconstruction [26.165314261806603]
We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation.
Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes.
arXiv Detail & Related papers (2022-12-01T18:59:55Z) - Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation.
To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering.
Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z) - LiP-Flow: Learning Inference-time Priors for Codec Avatars via
Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space.
A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective.
We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z) - GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis [43.4859484191223]
We propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene.
By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone.
arXiv Detail & Related papers (2020-07-05T20:37:39Z) - Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image
Decomposition [67.9464567157846]
We propose an autoencoder for joint generation of realistic images from synthetic 3D models while simultaneously decomposing real images into their intrinsic shape and appearance properties.
Our experiments confirm that a joint treatment of rendering and decomposition is indeed beneficial and that our approach outperforms state-of-the-art image-to-image translation baselines both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-06-29T12:53:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.