Multimodal-driven Talking Face Generation via a Unified Diffusion-based
Generator
- URL: http://arxiv.org/abs/2305.02594v2
- Date: Tue, 9 May 2023 12:01:14 GMT
- Title: Multimodal-driven Talking Face Generation via a Unified Diffusion-based
Generator
- Authors: Chao Xu, Shaoting Zhu, Junwei Zhu, Tianxin Huang, Jiangning Zhang,
Ying Tai, Yong Liu
- Abstract summary: Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio.
Existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature paradigm coupled with unstable GAN frameworks.
We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes.
- Score: 29.58245990622227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal-driven talking face generation refers to animating a portrait with
the given pose, expression, and gaze transferred from the driving image and
video, or estimated from the text and audio. However, existing methods ignore
the potential of text modal, and their generators mainly follow the
source-oriented feature rearrange paradigm coupled with unstable GAN
frameworks. In this work, we first represent the emotion in the text prompt,
which could inherit rich semantics from the CLIP, allowing flexible and
generalized emotion control. We further reorganize these tasks as the
target-oriented texture transfer and adopt the Diffusion Models. More
specifically, given a textured face as the source and the rendered face
projected from the desired 3DMM coefficients as the target, our proposed
Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem
into multi-conditional denoising process, where a Texture Attention-based
module accurately models the correspondences between appearance and geometry
cues contained in source and target conditions, and incorporate extra implicit
information for high-fidelity talking face generation. Additionally, TGDM can
be gracefully tailored for face swapping. We derive a novel paradigm free of
unstable seesaw-style optimization, resulting in simple, stable, and effective
training and inference schemes. Extensive experiments demonstrate the
superiority of our method.
Related papers
- Large Body Language Models [1.9797215742507548]
We introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video)
LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Freche's Gesture Distance (FGD) and a 25% improvement in Freche's Inception Distance compared to existing approaches.
arXiv Detail & Related papers (2024-10-21T21:48:24Z) - MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.
Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.
We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z) - Controllable Talking Face Generation by Implicit Facial Keypoints Editing [6.036277153327655]
We present ControlTalk, a talking face generation method to control face expression deformation based on driven audio.
Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD.
arXiv Detail & Related papers (2024-06-05T02:54:46Z) - Controllable Face Synthesis with Semantic Latent Diffusion Models [6.438244172631555]
We propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing.
The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face.
arXiv Detail & Related papers (2024-03-19T14:02:13Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - GaFET: Learning Geometry-aware Facial Expression Translation from
In-The-Wild Images [55.431697263581626]
We introduce a novel Geometry-aware Facial Expression Translation framework, which is based on parametric 3D facial representations and can stably decoupled expression.
We achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures.
arXiv Detail & Related papers (2023-08-07T09:03:35Z) - Energy-Based Cross Attention for Bayesian Context Update in
Text-to-Image Diffusion Models [62.603753097900466]
We present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors.
Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder.
Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts.
arXiv Detail & Related papers (2023-06-16T14:30:41Z) - Controlling Text-to-Image Diffusion by Orthogonal Finetuning [74.21549380288631]
We introduce a principled finetuning method -- Orthogonal Finetuning (OFT) for adapting text-to-image diffusion models to downstream tasks.
Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere.
We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.
arXiv Detail & Related papers (2023-06-12T17:59:23Z) - One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural
Radiance Field [81.07651217942679]
Talking head generation aims to generate faces that maintain the identity information of the source image and imitate the motion of the driving image.
We propose HiDe-NeRF, which achieves high-fidelity and free-view talking-head synthesis.
arXiv Detail & Related papers (2023-04-11T09:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.