CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
- URL: http://arxiv.org/abs/2509.06579v1
- Date: Mon, 08 Sep 2025 11:49:51 GMT
- Title: CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
- Authors: Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, Federico Tombari,
- Abstract summary: CausNVS is a multi-view diffusion model in an autoregressive setting.<n>It supports arbitrary input-output view configurations and generates views sequentially.<n>It achieves consistently strong visual quality across diverse settings.
- Score: 48.43677384182078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.
Related papers
- OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis [80.3346344429389]
We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks.<n>Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs.<n>For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control.
arXiv Detail & Related papers (2025-12-11T18:59:05Z) - CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model [50.93869080795228]
Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task.<n>Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities.<n>We present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion.
arXiv Detail & Related papers (2025-11-17T08:20:06Z) - Scaling Sequence-to-Sequence Generative Neural Rendering [37.23230422802279]
Kaleido is a family of generative models designed for photorealistic, unified object- and scene-level neural rendering.<n>We introduce key architectural innovations that enable our model to:.<n>perform generative view synthesis without explicit 3D representations;.<n>generate any number of 6-DoF target views conditioned on any number of reference views;.<n> seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer.
arXiv Detail & Related papers (2025-10-05T15:03:31Z) - ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View [11.346049532150127]
textbfARSS is a framework that generates novel views from a single image conditioned on a camera trajectory.<n>Our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models.
arXiv Detail & Related papers (2025-09-27T00:03:09Z) - Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion [2.5479056464266994]
Novel view synthesis (NVS) from a single image is highly illposed due to large unobserved regions.<n>We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view.<n>Our approach outperforms existing methods in generating coherent views along user-defined trajectories.
arXiv Detail & Related papers (2025-08-31T13:27:15Z) - Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models [83.76517697509156]
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input.<n>We propose a novel iterative sliding denoising process to enhance view-temporal consistency of the 4D diffusion model.<n>Our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches.
arXiv Detail & Related papers (2025-07-17T17:59:17Z) - AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views [57.13066710710485]
AnySplat is a feed forward network for novel view synthesis from uncalibrated image collections.<n>A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance.<n>In extensive zero shot evaluations, AnySplat matches the quality of pose aware baselines in both sparse and dense view scenarios.
arXiv Detail & Related papers (2025-05-29T17:49:56Z) - Stable Virtual Camera: Generative View Synthesis with Diffusion Models [51.71244310522393]
We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene.<n>Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy.<n>Our method can generate high-quality videos lasting up to half a minute with seamless loop closure.
arXiv Detail & Related papers (2025-03-18T17:57:22Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.