ToonTalker: Cross-Domain Face Reenactment
- URL: http://arxiv.org/abs/2308.12866v1
- Date: Thu, 24 Aug 2023 15:43:14 GMT
- Title: ToonTalker: Cross-Domain Face Reenactment
- Authors: Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang,
Baoyuan Wu, Yujiu Yang
- Abstract summary: Cross-domain face reenactment involves driving a cartoon image with the video of a real person and vice versa.
Recently, many works have focused on one-shot talking face generation to drive a portrait with a real video.
We propose a transformer-based framework to align the motions from different domains into a common latent space.
- Score: 80.52472147553333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We target cross-domain face reenactment in this paper, i.e., driving a
cartoon image with the video of a real person and vice versa. Recently, many
works have focused on one-shot talking face generation to drive a portrait with
a real video, i.e., within-domain reenactment. Straightforwardly applying those
methods to cross-domain animation will cause inaccurate expression transfer,
blur effects, and even apparent artifacts due to the domain shift between
cartoon and real faces. Only a few works attempt to settle cross-domain face
reenactment. The most related work AnimeCeleb requires constructing a dataset
with pose vector and cartoon image pairs by animating 3D characters, which
makes it inapplicable anymore if no paired data is available. In this paper, we
propose a novel method for cross-domain reenactment without paired data.
Specifically, we propose a transformer-based framework to align the motions
from different domains into a common latent space where motion transfer is
conducted via latent code addition. Two domain-specific motion encoders and two
learnable motion base memories are used to capture domain properties. A source
query transformer and a driving one are exploited to project domain-specific
motion to the canonical space. The edited motion is projected back to the
domain of the source with a transformer. Moreover, since no paired data is
provided, we propose a novel cross-domain training scheme using data from two
domains with the designed analogy constraint. Besides, we contribute a cartoon
dataset in Disney style. Extensive evaluations demonstrate the superiority of
our method over competing methods.
Related papers
- Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment [64.02822911038848]
We present AnimateZoo, a zero-shot diffusion-based video generator to produce animal animations.
Key technique used in our AnimateZoo is subject alignment, which includes two steps.
Our model is capable of generating videos characterized by accurate movements, consistent appearance, and high-fidelity frames.
arXiv Detail & Related papers (2024-04-07T12:57:41Z) - Pose-to-Motion: Cross-Domain Motion Retargeting with Pose Prior [48.104051952928465]
Current learning-based motion synthesis methods depend on extensive motion datasets.
pose data is more accessible, since posed characters are easier to create and can even be extracted from images.
Our method generates plausible motions for characters that have only pose data by transferring motion from an existing motion capture dataset of another character.
arXiv Detail & Related papers (2023-10-31T08:13:00Z) - Expression Domain Translation Network for Cross-domain Head Reenactment [35.42539568449744]
Cross-domain head reenactment aims to transfer human motions to domains outside the human, including cartoon characters.
Previous work introduced a large-scale anime dataset called AnimeCeleb and a cross-domain head reenactment model.
We introduce a novel expression domain translation network that transforms human expressions into anime expressions.
arXiv Detail & Related papers (2023-10-16T05:14:54Z) - Motion Transformer for Unsupervised Image Animation [37.35527776043379]
Image animation aims to animate a source image by using motion learned from a driving video.
Current state-of-the-art methods typically use convolutional neural networks (CNNs) to predict motion information.
We propose a new method, the motion transformer, which is the first attempt to build a motion estimator based on a vision transformer.
arXiv Detail & Related papers (2022-09-28T12:04:58Z) - JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting [53.28477676794658]
unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
arXiv Detail & Related papers (2021-06-17T17:32:32Z) - Realistic Face Reenactment via Self-Supervised Disentangling of Identity
and Pose [23.211318473026243]
We propose a self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos.
Our approach combines two deforming autoencoders with the latest advances in the conditional generation.
Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
arXiv Detail & Related papers (2020-03-29T06:45:17Z) - First Order Motion Model for Image Animation [90.712718329677]
Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video.
Our framework addresses this problem without using any annotation or prior information about the specific object to animate.
arXiv Detail & Related papers (2020-02-29T07:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.