Related papers: Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

URL: http://arxiv.org/abs/2502.07203v1
Date: Tue, 11 Feb 2025 02:53:48 GMT
Title: Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion
Authors: Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang,
Abstract summary: Playmate is proposed to generate more lifelike facial expressions and talking faces.<n>In the first stage, we introduce a decoupled implicit 3D representation to facilitate more accurate attribute disentanglement.<n>In the second stage, we introduce an emotion-control module to encode emotion control information into the latent space.
Score: 6.677873152109559
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at https://playmate111.github.io.

Related papers

EditEmoTalk: Controllable Speech-Driven 3D Facial Animation with Continuous Expression Editing [24.15552429255594]
Speech-driven 3D facial animation aims to generate realistic and expressive facial motions directly from audio.<n>We present EditEmoTalk, a controllable speech-driven 3D facial animation framework with continuous emotion editing.
arXiv Detail & Related papers (2026-01-15T02:21:22Z)
EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z)
Taming Transformer for Emotion-Controllable Talking Face Generation [61.835295250047196]
We propose a novel method to tackle the emotion-controllable talking face generation task discretely.<n>Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens.<n>We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios.
arXiv Detail & Related papers (2025-08-20T02:16:52Z)
MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding [48.54455964043634]
MEDTalk is a novel framework for fine-grained and dynamic emotional talking head generation.<n>We integrate audio and speech text, predicting frame-wise intensity variations and dynamically adjusting static emotion features to generate realistic emotional expressions.<n>Our generated results can be conveniently integrated into the industrial production pipeline.
arXiv Detail & Related papers (2025-07-08T15:14:27Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion [5.954758598327494]
EMOdiffhead is a novel method for emotional talking head video generation. It enables fine-grained control of emotion categories and intensities. It achieves state-of-the-art performance compared to other emotion portrait animation methods.
arXiv Detail & Related papers (2024-09-11T13:23:22Z)
DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation [14.07086606183356]
Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications.<n>Current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion.<n>We introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs.
arXiv Detail & Related papers (2024-08-12T08:56:49Z)
EmoFace: Audio-driven Emotional 3D Face Animation [3.573880705052592]
EmoFace is a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements. Our proposed methodology can be applied in producing dialogues animations of non-playable characters in video games, and driving avatars in virtual reality environments.
arXiv Detail & Related papers (2024-07-17T11:32:16Z)
SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation [13.459396544300137]
We propose a novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation. We introduce Inter-Reconstructed Feature Disentanglement (IRFD) module to decouple facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space.
arXiv Detail & Related papers (2024-05-12T11:41:44Z)
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation [75.90730434449874]
We introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.
arXiv Detail & Related papers (2023-12-21T05:03:18Z)
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance. We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z)
Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z)
Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.