Robust Pose Transfer with Dynamic Details using Neural Video Rendering
- URL: http://arxiv.org/abs/2106.14132v3
- Date: Mon, 8 May 2023 14:59:47 GMT
- Title: Robust Pose Transfer with Dynamic Details using Neural Video Rendering
- Authors: Yang-tian Sun, Hao-zhi Huang, Xuan Wang, Yu-kun Lai, Wei Liu, Lin Gao
- Abstract summary: We propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net)
To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics.
We demonstrate that our neural human video is capable of achieving both clearer dynamic details and more robust performance even on short videos with only 2k - 4k frames.
- Score: 48.48929344349387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pose transfer of human videos aims to generate a high fidelity video of a
target person imitating actions of a source person. A few studies have made
great progress either through image translation with deep latent features or
neural rendering with explicit 3D features. However, both of them rely on large
amounts of training data to generate realistic results, and the performance
degrades on more accessible internet videos due to insufficient training
frames. In this paper, we demonstrate that the dynamic details can be preserved
even trained from short monocular videos. Overall, we propose a neural video
rendering framework coupled with an image-translation-based dynamic details
generation network (D2G-Net), which fully utilizes both the stability of
explicit 3D features and the capacity of learning components. To be specific, a
novel texture representation is presented to encode both the static and
pose-varying appearance characteristics, which is then mapped to the image
space and rendered as a detail-rich frame in the neural rendering stage.
Moreover, we introduce a concise temporal loss in the training stage to
suppress the detail flickering that is made more visible due to high-quality
dynamic details generated by our method. Through extensive comparisons, we
demonstrate that our neural human video renderer is capable of achieving both
clearer dynamic details and more robust performance even on accessible short
videos with only 2k - 4k frames.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency.
The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z) - Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D
Image Representations [92.88108411154255]
We present a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene.
We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines.
arXiv Detail & Related papers (2022-09-07T23:24:09Z) - Flow Guided Transformable Bottleneck Networks for Motion Retargeting [29.16125343915916]
Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model.
Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention.
Inspired by the Transformable Bottleneck Network, we propose an approach based on an implicit volumetric representation of the image content.
arXiv Detail & Related papers (2021-06-14T21:58:30Z) - Neural 3D Video Synthesis [18.116032726623608]
We propose a novel approach for 3D video synthesis that is able to represent multi-view video recordings of a dynamic real-world scene.
Our approach takes the high quality and compactness of static neural radiance fields in a new direction: to a model-free, dynamic setting.
We demonstrate that our method can render high-fidelity wide-angle novel views at over 1K resolution, even for highly complex and dynamic scenes.
arXiv Detail & Related papers (2021-03-03T18:47:40Z) - Vid2Actor: Free-viewpoint Animatable Person Synthesis from Video in the
Wild [22.881898195409885]
Given an "in-the-wild" video of a person, we reconstruct an animatable model of the person in the video.
The output model can be rendered in any body pose to any camera view, via the learned controls, without explicit 3D mesh reconstruction.
arXiv Detail & Related papers (2020-12-23T18:50:42Z) - Neural Human Video Rendering by Learning Dynamic Textures and
Rendering-to-Video Translation [99.64565200170897]
We propose a novel human video synthesis method by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space.
We show several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-01-14T18:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.