Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
- URL: http://arxiv.org/abs/2504.11092v2
- Date: Fri, 18 Apr 2025 05:42:44 GMT
- Title: Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting
- Authors: Jiaxin Huang, Sheng Miao, BangBang Yang, Yuewen Ma, Yiyi Liao,
- Abstract summary: We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views.<n>This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints.<n>Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.
- Score: 26.54811754399946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion. See our project page: https://xdimlab.github.io/Vivid4D/.
Related papers
- FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Geometry-Complete 4D Reconstruction [40.47706321464456]
FreeOrbit4D is an effective training-free framework that tackles geometric ambiguity by recovering a geometry-complete 4D proxy as structural grounding for video generation.<n>Our experiments show that FreeOrbit4D produces more faithful videos under challenging large-angle redirected videos.
arXiv Detail & Related papers (2026-01-26T22:03:46Z) - LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models [52.656349227001925]
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory.<n>Existing methods face two distinct challenges.<n>We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model.
arXiv Detail & Related papers (2026-01-21T05:46:03Z) - SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting [83.5106058182799]
We introduce SEE4D, a pose-free, trajectory-to-camera framework for 4D world modeling from casual videos.<n>A view-conditional video in model is trained to learn a robust geometry prior to denoising realistically synthesized images.<n>We validate See4D on cross-view video generation and sparse reconstruction benchmarks.
arXiv Detail & Related papers (2025-10-30T17:59:39Z) - 4D Driving Scene Generation With Stereo Forcing [62.47705572424127]
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization.<n>We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency.
arXiv Detail & Related papers (2025-09-24T15:37:17Z) - S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix [60.060882467801484]
We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos.<n>Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel textitframe matrix inpainting framework.<n>We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope.
arXiv Detail & Related papers (2025-08-11T14:50:03Z) - Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models [83.76517697509156]
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input.<n>We propose a novel iterative sliding denoising process to enhance view-temporal consistency of the 4D diffusion model.<n>Our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches.
arXiv Detail & Related papers (2025-07-17T17:59:17Z) - Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos [44.36499624938911]
We explore novel-view synthesis for dynamic scenes from monocular videos.<n>Our approach is based on three key insights.<n>We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.
arXiv Detail & Related papers (2025-07-16T21:40:29Z) - ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs [7.3647304690955915]
We introduce Video-Aware Diffusion Reconstruction (ViDAR), a novel 4D reconstruction framework.<n>ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity.<n>Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-23T16:01:15Z) - Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis [45.64047250474718]
Despite advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data.
We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator.
Our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect.
arXiv Detail & Related papers (2025-04-30T19:06:09Z) - Can Video Diffusion Model Reconstruct 4D Geometry? [66.5454886982702]
Sora3R is a novel framework that taps into richtemporals of large dynamic video diffusion models to infer 4D pointmaps from casual videos.<n>Experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction.
arXiv Detail & Related papers (2025-03-27T01:44:46Z) - CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models [98.03734318657848]
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video.<n>We leverage a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis.<n>We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks.
arXiv Detail & Related papers (2024-11-27T18:57:16Z) - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos.
Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth.
We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z) - Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer [13.969883154405995]
We propose a novel learning approach for feed-forward one-shot 4D head avatar synthesis.
We employ pseudo multi-view videos to learn a 4D head synthesizer in a data-driven manner.
arXiv Detail & Related papers (2024-03-20T13:09:54Z) - DRSM: efficient neural 4d decomposition for dynamic reconstruction in
stationary monocular cameras [21.07910546072467]
We present a novel framework to tackle 4D decomposition problem for dynamic scenes in monocular cameras.
Our framework utilizes decomposed static and dynamic feature planes to represent 4D scenes and emphasizes the learning of dynamic regions through dense ray casting.
arXiv Detail & Related papers (2024-02-01T16:38:51Z) - Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from
a Single Image [59.18564636990079]
We study the problem of synthesizing a long-term dynamic video from only a single image.
Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories.
We present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image.
arXiv Detail & Related papers (2023-08-20T12:53:50Z) - Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis [76.72505510632904]
We present Total-Recon, the first method to reconstruct deformable scenes from long monocular RGBD videos.
Our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into root-body motion and local articulations.
arXiv Detail & Related papers (2023-04-24T17:59:52Z) - State of the Art in Dense Monocular Non-Rigid 3D Reconstruction [100.9586977875698]
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular 2D image observations is a long-standing and actively researched area of computer vision and graphics.
This survey focuses on state-of-the-art methods for dense non-rigid 3D reconstruction of various deformable objects and composite scenes from monocular videos or sets of monocular views.
arXiv Detail & Related papers (2022-10-27T17:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.