Related papers: Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

URL: http://arxiv.org/abs/2111.00203v1
Date: Sat, 30 Oct 2021 08:15:27 GMT
Title: Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis
Authors: Haozhe Wu, Jia Jia, Haoyu Wang, Yishun Dou, Chao Duan, Qingshan Deng
Abstract summary: We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
Score: 17.650661515807993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: People talk with diversified styles. For one piece of speech, different talking styles exhibit significant differences in the facial and head pose movements. For example, the "excited" style usually talks with the mouth wide open, while the "solemn" style is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences between different styles, it is necessary to incorporate the talking style into audio-driven talking face synthesis framework. In this paper, we propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. Specifically, we systematically investigate talking styles with our collected \textit{Ted-HD} dataset and construct style codes as several statistics of 3D morphable model~(3DMM) parameters. Afterwards, we devise a latent-style-fusion~(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes. We emphasize the following novel characteristics of our framework: (1) It doesn't require any annotation of the style, the talking style is learned in an unsupervised manner from talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary videos, and the style codes can also be interpolated to generate new styles. Extensive experiments demonstrate that the proposed framework has the ability to synthesize more natural and expressive talking styles compared with baseline methods.

Related papers

MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z)
StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads [46.749597670092484]
Existing one-shot talking head methods fail to produce diverse speaking styles in the final videos. We propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference videos. Our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
arXiv Detail & Related papers (2024-09-14T03:49:38Z)
SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model [66.34929233269409]
Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality. We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-05T06:27:32Z)
Say Anything with Any Style [9.50806457742173]
Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook. Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
arXiv Detail & Related papers (2024-03-11T01:20:03Z)
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding. We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z)
AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach. It learns the personalized talking style from a reference video of about 10 seconds. It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z)
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder. In particular, our style includes the generation of head poses, thereby enhancing user perception. We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z)
StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles [43.12918949398099]
We propose a one-shot style-controllable talking face generation framework. We aim to attain a speaking style from an arbitrary reference speaking video. We then drive the one-shot portrait to speak with the reference speaking style and another piece of audio.
arXiv Detail & Related papers (2023-01-03T13:16:24Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy. Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches. We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.