DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and
View-Change Human-Centric Video Editing
- URL: http://arxiv.org/abs/2310.10624v2
- Date: Thu, 7 Dec 2023 16:32:47 GMT
- Title: DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and
View-Change Human-Centric Video Editing
- Authors: Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, Yuchao Gu, Rui
Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou
- Abstract summary: We introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation.
We propose the image-based video-NeRF editing pipeline with a set of innovative designs to provide consistent and controllable editing.
Our method, dubbed as DynVideo-E, significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% 95% for human preference.
- Score: 48.086102360155856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress in diffusion-based video editing, existing methods
are limited to short-length videos due to the contradiction between long-range
consistency and frame-wise editing. Prior attempts to address this challenge by
introducing video-2D representations encounter significant difficulties with
large-scale motion- and view-change videos, especially in human-centric
scenarios. To overcome this, we propose to introduce the dynamic Neural
Radiance Fields (NeRF) as the innovative video representation, where the
editing can be performed in the 3D spaces and propagated to the entire video
via the deformation field. To provide consistent and controllable editing, we
propose the image-based video-NeRF editing pipeline with a set of innovative
designs, including multi-view multi-pose Score Distillation Sampling (SDS) from
both the 2D personalized diffusion prior and 3D diffusion prior, reconstruction
losses, text-guided local parts super-resolution, and style transfer. Extensive
experiments demonstrate that our method, dubbed as DynVideo-E, significantly
outperforms SOTA approaches on two challenging datasets by a large margin of
50% ~ 95% for human preference. Code will be released at
https://showlab.github.io/DynVideo-E/.
Related papers
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer.
It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z) - Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution.
We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework [33.46782517803435]
Make-Your-Anchor is a system requiring only a one-minute video clip of an individual for training.
We finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances.
A novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos.
arXiv Detail & Related papers (2024-03-25T07:54:18Z) - Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [35.52770785430601]
We propose a novel hybrid video autoencoder, called HVtemporalDM, which can capture intricate dependencies more effectively.
The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video.
Our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details.
arXiv Detail & Related papers (2024-02-21T11:46:16Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.