TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model
- URL: http://arxiv.org/abs/2410.10696v1
- Date: Mon, 14 Oct 2024 16:38:10 GMT
- Title: TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model
- Authors: Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie, Youjian Zhao, Ziwei Liu,
- Abstract summary: We propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework.
Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling.
Our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data.
- Score: 100.35665852159785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, 2D speaking avatars have increasingly participated in everyday scenarios due to the fast development of facial animation techniques. However, most existing works neglect the explicit control of human bodies. In this paper, we propose to drive not only the faces but also the torso and gesture movements of a speaking figure. Inspired by recent advances in diffusion models, we propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework, which enables high-fidelity avatar reenactment from only short footage of monocular video. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Specifically, we carefully construct 2D and 3D structural information as intermediate guidance. While recent diffusion models adopt a side network for control information injection, they fail to synthesize temporally stable results even with person-specific fine-tuning. We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals. Moreover, we build a Memory-based Hand-Recovering module to help with the difficulties in hand-shape preserving. After pre-training, our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data. Extensive experiments demonstrate the effectiveness and superiority of our proposed framework. Resources can be found at https://guanjz20.github.io/projects/TALK-Act.
Related papers
- AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs.
Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues.
As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z) - Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework [33.46782517803435]
Make-Your-Anchor is a system requiring only a one-minute video clip of an individual for training.
We finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances.
A novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos.
arXiv Detail & Related papers (2024-03-25T07:54:18Z) - Synthesizing Moving People with 3D Control [88.68284137105654]
We present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence.
For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image.
Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses.
arXiv Detail & Related papers (2024-01-19T18:59:11Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - FAAC: Facial Animation Generation with Anchor Frame and Conditional
Control for Superior Fidelity and Editability [14.896554342627551]
We introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities.
This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models.
Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models.
arXiv Detail & Related papers (2023-12-06T02:55:35Z) - Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model [57.855362366674264]
We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues.
Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.
arXiv Detail & Related papers (2023-08-15T13:00:42Z) - AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars
Using 2D Diffusion [34.609403685504944]
We present AvatarFusion, a framework for zero-shot text-to-avatar generation.
We use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars.
We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes.
arXiv Detail & Related papers (2023-07-13T02:19:56Z) - AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation [14.062402203105712]
AvatarBooth is a novel method for generating high-quality 3D avatars using text prompts or specific images.
Our key contribution is the precise avatar generation control by using dual fine-tuned diffusion models.
We present a multi-resolution rendering strategy that facilitates coarse-to-fine supervision of 3D avatar generation.
arXiv Detail & Related papers (2023-06-16T14:18:51Z) - AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars [84.85009267371218]
We propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar.
Our approach builds on existing work to capture dynamic performances of human heads using neural field (NeRF) and edits this representation with a text-to-image diffusion model.
Our method edits the full head in a canonical space, and then propagates these edits to remaining time steps via a pretrained deformation network.
arXiv Detail & Related papers (2023-06-01T11:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.