Stable Virtual Camera: Generative View Synthesis with Diffusion Models
- URL: http://arxiv.org/abs/2503.14489v2
- Date: Tue, 01 Apr 2025 18:22:54 GMT
- Title: Stable Virtual Camera: Generative View Synthesis with Diffusion Models
- Authors: Jensen Zhou, Hang Gao, Vikram Voleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, Varun Jampani,
- Abstract summary: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene.<n>Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy.<n>Our method can generate high-quality videos lasting up to half a minute with seamless loop closure.
- Score: 51.71244310522393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. Project page with code and model: https://stable-virtual-camera.github.io/.
Related papers
- Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers [4.015569252776372]
ArchonView is a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining.
Our model also exhibits robust performance even for difficult camera poses where previous methods fail, and is several times faster in inference speed compared to diffusion.
arXiv Detail & Related papers (2025-03-17T17:59:59Z) - Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis [23.967904337714234]
We propose a set-based generative model that can simultaneously generate multiple, self-consistent new views.
Our approach is not limited to generating a single image at a time and can condition on a variable number of views.
We show that the model is capable of generating sets of views that have no natural ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.
arXiv Detail & Related papers (2024-02-28T02:06:11Z) - Fast Non-Rigid Radiance Fields from Monocularized Data [66.74229489512683]
This paper proposes a new method for full 360deg inward-facing novel view synthesis of non-rigidly deforming scenes.
At the core of our method are 1) An efficient deformation module that decouples the processing of spatial and temporal information for accelerated training and inference; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field.
In both cases, our method is significantly faster than previous methods, converging in less than 7 minutes and achieving real-time framerates at 1K resolution, while obtaining a higher visual accuracy for generated novel views.
arXiv Detail & Related papers (2022-12-02T18:51:10Z) - Novel View Synthesis with Diffusion Models [56.55571338854636]
We present 3DiM, a diffusion model for 3D novel view synthesis.
It is able to translate a single input view into consistent and sharp completions across many views.
3DiM can generate multiple views that are 3D consistent using a novel technique called conditioning.
arXiv Detail & Related papers (2022-10-06T16:59:56Z) - Free View Synthesis [100.86844680362196]
We present a method for novel view synthesis from input images that are freely distributed around a scene.
Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts.
arXiv Detail & Related papers (2020-08-12T18:16:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.