Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face
Synthesis
- URL: http://arxiv.org/abs/2111.00203v1
- Date: Sat, 30 Oct 2021 08:15:27 GMT
- Title: Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face
Synthesis
- Authors: Haozhe Wu, Jia Jia, Haoyu Wang, Yishun Dou, Chao Duan, Qingshan Deng
- Abstract summary: We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video.
We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
- Score: 17.650661515807993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People talk with diversified styles. For one piece of speech, different
talking styles exhibit significant differences in the facial and head pose
movements. For example, the "excited" style usually talks with the mouth wide
open, while the "solemn" style is more standardized and seldomly exhibits
exaggerated motions. Due to such huge differences between different styles, it
is necessary to incorporate the talking style into audio-driven talking face
synthesis framework. In this paper, we propose to inject style into the talking
face synthesis framework through imitating arbitrary talking style of the
particular reference video. Specifically, we systematically investigate talking
styles with our collected \textit{Ted-HD} dataset and construct style codes as
several statistics of 3D morphable model~(3DMM) parameters. Afterwards, we
devise a latent-style-fusion~(LSF) model to synthesize stylized talking faces
by imitating talking styles from the style codes. We emphasize the following
novel characteristics of our framework: (1) It doesn't require any annotation
of the style, the talking style is learned in an unsupervised manner from
talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary
videos, and the style codes can also be interpolated to generate new styles.
Extensive experiments demonstrate that the proposed framework has the ability
to synthesize more natural and expressive talking styles compared with baseline
methods.
Related papers
- Say Anything with Any Style [9.50806457742173]
Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook.
Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
arXiv Detail & Related papers (2024-03-11T01:20:03Z) - Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations [65.29513437838457]
Even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles.
We propose Spoken-LLM framework that can model the linguistic content and the speaking styles.
We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles.
arXiv Detail & Related papers (2024-02-20T07:51:43Z) - Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style.
We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding.
We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z) - AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach.
It learns the personalized talking style from a reference video of about 10 seconds.
It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z) - DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z) - Conversation Style Transfer using Few-Shot Learning [56.43383396058639]
In this paper, we introduce conversation style transfer as a few-shot learning problem.
We propose a novel in-context learning approach to solve the task with style-free dialogues as a pivot.
We show that conversation style transfer can also benefit downstream tasks.
arXiv Detail & Related papers (2023-02-16T15:27:00Z) - StyleTalk: One-shot Talking Head Generation with Controllable Speaking
Styles [43.12918949398099]
We propose a one-shot style-controllable talking face generation framework.
We aim to attain a speaking style from an arbitrary reference speaking video.
We then drive the one-shot portrait to speak with the reference speaking style and another piece of audio.
arXiv Detail & Related papers (2023-01-03T13:16:24Z) - Text-driven Emotional Style Control and Cross-speaker Style Transfer in
Neural TTS [7.384726530165295]
Style control of synthetic speech is often restricted to discrete emotion categories.
We propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS.
arXiv Detail & Related papers (2022-07-13T07:05:44Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.