Flow Guided Transformable Bottleneck Networks for Motion Retargeting
- URL: http://arxiv.org/abs/2106.07771v1
- Date: Mon, 14 Jun 2021 21:58:30 GMT
- Title: Flow Guided Transformable Bottleneck Networks for Motion Retargeting
- Authors: Jian Ren, Menglei Chai, Oliver J. Woodford, Kyle Olszewski, Sergey
Tulyakov
- Abstract summary: Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model.
Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention.
Inspired by the Transformable Bottleneck Network, we propose an approach based on an implicit volumetric representation of the image content.
- Score: 29.16125343915916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human motion retargeting aims to transfer the motion of one person in a
"driving" video or set of images to another person. Existing efforts leverage a
long training video from each target person to train a subject-specific motion
transfer model. However, the scalability of such methods is limited, as each
model can only generate videos for the given target subject, and such training
videos are labor-intensive to acquire and process. Few-shot motion transfer
techniques, which only require one or a few images from a target, have recently
drawn considerable attention. Methods addressing this task generally use either
2D or explicit 3D representations to transfer motion, and in doing so,
sacrifice either accurate geometric modeling or the flexibility of an
end-to-end learned representation. Inspired by the Transformable Bottleneck
Network, which renders novel views and manipulations of rigid objects, we
propose an approach based on an implicit volumetric representation of the image
content, which can then be spatially manipulated using volumetric flow fields.
We address the challenging question of how to aggregate information across
different body poses, learning flow fields that allow for combining content
from the appropriate regions of input images of highly non-rigid human subjects
performing complex motions into a single implicit volumetric representation.
This allows us to learn our 3D representation solely from videos of moving
people. Armed with both 3D object understanding and end-to-end learned
rendering, this categorically novel representation delivers state-of-the-art
image generation quality, as shown by our quantitative and qualitative
evaluations.
Related papers
- MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - MotionBERT: A Unified Perspective on Learning Human Motion
Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z) - Self-Supervised 3D Human Pose Estimation in Static Video Via Neural
Rendering [5.568218439349004]
Inferring 3D human pose from 2D images is a challenging and long-standing problem in the field of computer vision.
We present preliminary results for a method to estimate 3D pose from 2D video containing a single person.
arXiv Detail & Related papers (2022-10-10T09:24:07Z) - Neural Novel Actor: Learning a Generalized Animatable Neural
Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons.
The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z) - Action2video: Generating Videos of Human 3D Actions [31.665831044217363]
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories.
Key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances.
Action2motionally generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
arXiv Detail & Related papers (2021-11-12T20:20:37Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - On Development and Evaluation of Retargeting Human Motion and Appearance
in Monocular Videos [2.870762512009438]
Transferring human motion and appearance between videos of human actors remains one of the key challenges in Computer Vision.
We propose a novel and high-performant approach based on a hybrid image-based rendering technique that exhibits competitive visual quality.
We also present a new video benchmark dataset composed of different videos with annotated human motions to evaluate the task of synthesizing people's videos.
arXiv Detail & Related papers (2021-03-29T13:17:41Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Single-Shot Freestyle Dance Reenactment [89.91619150027265]
The task of motion transfer between a source dancer and a target person is a special case of the pose transfer problem.
We propose a novel method that can reanimate a single image by arbitrary video sequences, unseen during training.
arXiv Detail & Related papers (2020-12-02T12:57:43Z) - Chained Representation Cycling: Learning to Estimate 3D Human Pose and
Shape by Cycling Between Representations [73.11883464562895]
We propose a new architecture that facilitates unsupervised, or lightly supervised, learning.
We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images.
While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
arXiv Detail & Related papers (2020-01-06T14:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.