Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
- URL: http://arxiv.org/abs/2406.02541v3
- Date: Thu, 6 Jun 2024 01:40:56 GMT
- Title: Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
- Authors: Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen,
- Abstract summary: Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
- Score: 94.84688557937123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in zero-shot video diffusion models have shown promise for text-driven video editing, but challenges remain in achieving high temporal consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. In the first stage, Video-3DGS employs an improved version of COLMAP, referred to as MC-COLMAP, which processes original videos using a Masked and Clipped approach. For each video clip, MC-COLMAP generates the point clouds for dynamic foreground objects and complex backgrounds. These point clouds are utilized to initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent foreground and background views. Both foreground and background views are then merged with a 2D learnable parameter map to reconstruct full views. In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model. To demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive experiments across two related tasks: Video Reconstruction and Video Editing. Video-3DGS trained with 3k iterations significantly improves video reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency (x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
Related papers
- StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos [44.51044100125421]
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.
Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.
arXiv Detail & Related papers (2024-09-11T17:52:07Z) - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - 3DEgo: 3D Editing on the Go! [6.072473323242202]
We introduce 3DEgo to address a novel problem of directly synthesizing 3D scenes from monocular videos guided by textual prompts.
Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow.
3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources.
arXiv Detail & Related papers (2024-07-14T07:03:50Z) - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos.
Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth.
We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z) - Splatter a Video: Video Gaussian Representation for Versatile Processing [48.9887736125712]
Video representation is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing.
We introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians.
It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.
arXiv Detail & Related papers (2024-06-19T22:20:03Z) - Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [35.52770785430601]
We propose a novel hybrid video autoencoder, called HVtemporalDM, which can capture intricate dependencies more effectively.
The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video.
Our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details.
arXiv Detail & Related papers (2024-02-21T11:46:16Z) - DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and
View-Change Human-Centric Video Editing [48.086102360155856]
We introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation.
We propose the image-based video-NeRF editing pipeline with a set of innovative designs to provide consistent and controllable editing.
Our method, dubbed as DynVideo-E, significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% 95% for human preference.
arXiv Detail & Related papers (2023-10-16T17:48:10Z) - OmnimatteRF: Robust Omnimatte with 3D Background Modeling [42.844343885602214]
We propose a novel video matting method, OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background model.
The 2D layers preserve the details of the subjects, while the 3D background robustly reconstructs scenes in real-world videos.
arXiv Detail & Related papers (2023-09-14T14:36:22Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.