Related papers: SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

URL: http://arxiv.org/abs/2407.00367v1
Date: Sat, 29 Jun 2024 08:33:55 GMT
Title: SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix
Authors: Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, Yinda Zhang,
Abstract summary: We propose a pose-free and training-free approach for generating 3D stereoscopic videos. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth. We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
Score: 60.48666051245761
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, including Sora [4 ], Lumiere [2], WALT [8 ], and Zeroscope [ 42]. The experiments demonstrate that our method has a significant improvement over previous methods. The code will be released at \url{https://daipengwa.github.io/SVG_ProjectPage}.

Related papers

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians [7.051077403685518]
Humans excel at forecasting the future dynamics of a scene given just a single image.<n>Video generation models that can mimic this ability are an essential component for intelligent systems.<n>Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation.
arXiv Detail & Related papers (2026-01-02T13:04:47Z)
View-Consistent Diffusion Representations for 3D-Consistent Video Generation [60.68052293389281]
Current generated videos still contain visual artifacts arising from 3D inconsistencies.<n>We propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations.
arXiv Detail & Related papers (2025-11-24T11:16:55Z)
S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix [60.060882467801484]
We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos.<n>Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel textitframe matrix inpainting framework.<n>We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope.
arXiv Detail & Related papers (2025-08-11T14:50:03Z)
Restereo: Diffusion stereo video generation and restoration [43.208256051997616]
We introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model.<n>Our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos.
arXiv Detail & Related papers (2025-06-06T12:14:24Z)
PS4PRO: Pixel-to-pixel Supervision for Photorealistic Rendering and Optimization [3.53658451351123]
We propose to use video frame as the data augmentation method for neural rendering.<n>PS4PRO is trained on diverse video datasets, implicitly modeling camera movement as well as real-world 3D geometry.<n>Our results indicate that our method improves the reconstruction performance on both static and dynamic scenes.
arXiv Detail & Related papers (2025-05-28T17:35:39Z)
IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos [33.12653115668027]
Our method generates Multiplane Images (MPIs) that ensure geometric consistency. Our approach directly generates the final output through a single denoising process. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space.
arXiv Detail & Related papers (2025-04-27T08:56:02Z)
I2V3D: Controllable image-to-video generation with 3D guidance [42.23117201457898]
IV23D is a framework for animating static images into dynamic videos with precise 3D control. Our approach combines the precision of a computer graphics pipeline with advanced generative models.
arXiv Detail & Related papers (2025-03-12T18:26:34Z)
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation [62.64811405314847]
VidCRAFT3 is a novel framework for precise image-to-video generation. It enables control over camera motion, object motion, and lighting direction simultaneously. It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
arXiv Detail & Related papers (2025-02-11T13:11:59Z)
Wonderland: Navigating 3D Scenes from a Single Image [43.99037613068823]
We introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes.
arXiv Detail & Related papers (2024-12-16T18:58:17Z)
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models [75.03495065452955]
We present InfiniCube, a scalable method for generating dynamic 3D driving scenes with high fidelity and controllability. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
arXiv Detail & Related papers (2024-12-05T07:32:20Z)
Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z)
StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos [44.51044100125421]
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.
arXiv Detail & Related papers (2024-09-11T17:52:07Z)
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis [63.169364481672915]
We propose textbfViewCrafter, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images. Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames.
arXiv Detail & Related papers (2024-09-03T16:53:19Z)
Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE [22.072200443502457]
We propose Scene123, a 3D scene generation model that ensures realism and diversity through the video generation framework. Specifically, we warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model. To further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model.
arXiv Detail & Related papers (2024-08-10T08:09:57Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z)
Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [35.52770785430601]
We propose a novel hybrid video autoencoder, called HVtemporalDM, which can capture intricate dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video. Our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details.
arXiv Detail & Related papers (2024-02-21T11:46:16Z)
VGMShield: Mitigating Misuse of Video Generative Models [7.963591895964269]
We introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation. We first try to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos. Then, we investigate the textittracing problem, which maps a fake video back to a model that generates it.
arXiv Detail & Related papers (2024-02-20T16:39:23Z)
Video Autoencoder: self-supervised disentanglement of static 3D structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos. The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.