StyleTalk: One-shot Talking Head Generation with Controllable Speaking
Styles
- URL: http://arxiv.org/abs/2301.01081v2
- Date: Sat, 10 Jun 2023 14:37:49 GMT
- Title: StyleTalk: One-shot Talking Head Generation with Controllable Speaking
Styles
- Authors: Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding,
Zhidong Deng and Xin Yu
- Abstract summary: We propose a one-shot style-controllable talking face generation framework.
We aim to attain a speaking style from an arbitrary reference speaking video.
We then drive the one-shot portrait to speak with the reference speaking style and another piece of audio.
- Score: 43.12918949398099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Different people speak with diverse personalized speaking styles. Although
existing one-shot talking head methods have made significant progress in lip
sync, natural facial expressions, and stable head motions, they still cannot
generate diverse speaking styles in the final talking head videos. To tackle
this problem, we propose a one-shot style-controllable talking face generation
framework. In a nutshell, we aim to attain a speaking style from an arbitrary
reference speaking video and then drive the one-shot portrait to speak with the
reference speaking style and another piece of audio. Specifically, we first
develop a style encoder to extract dynamic facial motion patterns of a style
reference video and then encode them into a style code. Afterward, we introduce
a style-controllable decoder to synthesize stylized facial animations from the
speech content and style code. In order to integrate the reference speaking
style into generated videos, we design a style-aware adaptive transformer,
which enables the encoded style code to adjust the weights of the feed-forward
layers accordingly. Thanks to the style-aware adaptation mechanism, the
reference speaking style can be better embedded into synthesized videos during
decoding. Extensive experiments demonstrate that our method is capable of
generating talking head videos with diverse speaking styles from only one
portrait image and an audio clip while achieving authentic visual effects.
Project Page: https://github.com/FuxiVirtualHuman/styletalk.
Related papers
- StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads [46.749597670092484]
Existing one-shot talking head methods fail to produce diverse speaking styles in the final videos.
We propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference videos.
Our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
arXiv Detail & Related papers (2024-09-14T03:49:38Z) - Style-Preserving Lip Sync via Audio-Aware Style Reference [88.02195932723744]
Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals.
We develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video.
Experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
arXiv Detail & Related papers (2024-08-10T02:46:11Z) - Say Anything with Any Style [9.50806457742173]
Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook.
Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
arXiv Detail & Related papers (2024-03-11T01:20:03Z) - Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial
Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style.
We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding.
We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z) - Personalized Speech-driven Expressive 3D Facial Animation Synthesis with
Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility.
We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles)
Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z) - AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach.
It learns the personalized talking style from a reference video of about 10 seconds.
It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model.
It can synthesize a video of a talking person from a single reference image.
Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z) - Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face
Synthesis [17.650661515807993]
We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video.
We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
arXiv Detail & Related papers (2021-10-30T08:15:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.