Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos
- URL: http://arxiv.org/abs/2507.12646v1
- Date: Wed, 16 Jul 2025 21:40:29 GMT
- Title: Reconstruct, Inpaint, Finetune: Dynamic Novel-view Synthesis from Monocular Videos
- Authors: Kaihua Chen, Tarasha Khurana, Deva Ramanan,
- Abstract summary: We explore novel-view synthesis for dynamic scenes from monocular videos.<n>Our approach is based on three key insights.<n>We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.
- Score: 44.36499624938911
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.
Related papers
- Voyaging into Unbounded Dynamic Scenes from a Single View [31.85867311855001]
We propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content.<n>We render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions.<n>Experiments show that our model is able to generate unbounded scenes with consistent motions along fly-through cameras.
arXiv Detail & Related papers (2025-07-05T22:49:25Z) - CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models [98.03734318657848]
We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video.<n>We leverage a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis.<n>We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks.
arXiv Detail & Related papers (2024-11-27T18:57:16Z) - FreeVS: Generative View Synthesis on Free Driving Trajectory [55.49370963413221]
FreeVS is a novel fully generative approach that can synthesize camera views on free new trajectories in real driving scenes.
FreeVS can be applied to any validation sequences without reconstruction process and synthesis views on novel trajectories.
arXiv Detail & Related papers (2024-10-23T17:59:11Z) - Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation [54.60804602905519]
We learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together.
Our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds.
To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects.
arXiv Detail & Related papers (2024-07-31T08:54:50Z) - OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos [7.616167860385134]
It has long been challenging to recover the underlying dynamic 3D scene representations from a monocular RGB video.
We introduce a new framework, called OSN, to learn all plausible 3D scene configurations that match the input video.
Our method demonstrates a clear advantage in learning fine-grained 3D scene geometry.
arXiv Detail & Related papers (2024-07-08T05:03:46Z) - iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis [45.88928345042103]
We present a method for generating consistent novel views from a single source image.
Our approach focuses on maximizing the reuse of visible pixels from the source image.
We use a monocular depth estimator that transfers visible pixels from the source view to the target view.
arXiv Detail & Related papers (2023-10-24T20:33:19Z) - SparseGNV: Generating Novel Views of Indoor Scenes with Sparse Input
Views [16.72880076920758]
We present SparseGNV, a learning framework that incorporates 3D structures and image generative models to generate novel views.
SparseGNV is trained across a large indoor scene dataset to learn generalizable priors.
It can efficiently generate novel views of an unseen indoor scene in a feed-forward manner.
arXiv Detail & Related papers (2023-05-11T17:58:37Z) - Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency.
The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z) - Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the
Wild [22.881898195409885]
Given an "in-the-wild" video of a person, we reconstruct an animatable model of the person in the video.
The output model can be rendered in any body pose to any camera view, via the learned controls, without explicit 3D mesh reconstruction.
arXiv Detail & Related papers (2020-12-23T18:50:42Z) - Non-Rigid Neural Radiance Fields: Reconstruction and Novel View
Synthesis of a Dynamic Scene From Monocular Video [76.19076002661157]
Non-Rigid Neural Radiance Fields (NR-NeRF) is a reconstruction and novel view synthesis approach for general non-rigid dynamic scenes.
We show that even a single consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views.
arXiv Detail & Related papers (2020-12-22T18:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.