Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face
Synthesis
- URL: http://arxiv.org/abs/2111.00203v1
- Date: Sat, 30 Oct 2021 08:15:27 GMT
- Title: Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face
Synthesis
- Authors: Haozhe Wu, Jia Jia, Haoyu Wang, Yishun Dou, Chao Duan, Qingshan Deng
- Abstract summary: We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video.
We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
- Score: 17.650661515807993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People talk with diversified styles. For one piece of speech, different
talking styles exhibit significant differences in the facial and head pose
movements. For example, the "excited" style usually talks with the mouth wide
open, while the "solemn" style is more standardized and seldomly exhibits
exaggerated motions. Due to such huge differences between different styles, it
is necessary to incorporate the talking style into audio-driven talking face
synthesis framework. In this paper, we propose to inject style into the talking
face synthesis framework through imitating arbitrary talking style of the
particular reference video. Specifically, we systematically investigate talking
styles with our collected \textit{Ted-HD} dataset and construct style codes as
several statistics of 3D morphable model~(3DMM) parameters. Afterwards, we
devise a latent-style-fusion~(LSF) model to synthesize stylized talking faces
by imitating talking styles from the style codes. We emphasize the following
novel characteristics of our framework: (1) It doesn't require any annotation
of the style, the talking style is learned in an unsupervised manner from
talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary
videos, and the style codes can also be interpolated to generate new styles.
Extensive experiments demonstrate that the proposed framework has the ability
to synthesize more natural and expressive talking styles compared with baseline
methods.
Related papers
- MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos.
MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG.
Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z) - StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads [46.749597670092484]
Existing one-shot talking head methods fail to produce diverse speaking styles in the final videos.
We propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference videos.
Our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
arXiv Detail & Related papers (2024-09-14T03:49:38Z) - SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model [66.34929233269409]
Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality.
We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG.
Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-05T06:27:32Z) - Say Anything with Any Style [9.50806457742173]
Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook.
Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
arXiv Detail & Related papers (2024-03-11T01:20:03Z) - Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style.
We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding.
We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z) - AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach.
It learns the personalized talking style from a reference video of about 10 seconds.
It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z) - DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z) - StyleTalk: One-shot Talking Head Generation with Controllable Speaking
Styles [43.12918949398099]
We propose a one-shot style-controllable talking face generation framework.
We aim to attain a speaking style from an arbitrary reference speaking video.
We then drive the one-shot portrait to speak with the reference speaking style and another piece of audio.
arXiv Detail & Related papers (2023-01-03T13:16:24Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.