Related papers: StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

URL: http://arxiv.org/abs/2409.09292v1
Date: Sat, 14 Sep 2024 03:49:38 GMT
Title: StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads
Authors: Suzhen Wang, Yifeng Ma, Yu Ding, Zhipeng Hu, Changjie Fan, Tangjie Lv, Zhidong Deng, Xin Yu,
Abstract summary: Existing one-shot talking head methods fail to produce diverse speaking styles in the final videos. We propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference videos. Our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.
Score: 46.749597670092484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Individuals have unique facial expression and head pose styles that reflect their personalized speaking styles. Existing one-shot talking head methods cannot capture such personalized characteristics and therefore fail to produce diverse speaking styles in the final videos. To address this challenge, we propose a one-shot style-controllable talking face generation method that can obtain speaking styles from reference speaking videos and drive the one-shot portrait to speak with the reference speaking styles and another piece of audio. Our method aims to synthesize the style-controllable coefficients of a 3D Morphable Model (3DMM), including facial expressions and head movements, in a unified framework. Specifically, the proposed framework first leverages a style encoder to extract the desired speaking styles from the reference videos and transform them into style codes. Then, the framework uses a style-aware decoder to synthesize the coefficients of 3DMM from the audio input and style codes. During decoding, our framework adopts a two-branch architecture, which generates the stylized facial expression coefficients and stylized head movement coefficients, respectively. After obtaining the coefficients of 3DMM, an image renderer renders the expression coefficients into a specific person's talking-head video. Extensive experiments demonstrate that our method generates visually authentic talking head videos with diverse speaking styles from only one portrait image and an audio clip.

Related papers

DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation [13.089363781114477]
DiTalker is a unified DiT-based framework for speaking style-controllable portrait animation.<n>We introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers.<n>Experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability.
arXiv Detail & Related papers (2025-07-29T08:23:56Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z)
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding. We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z)
Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility. We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles) Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z)
AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach. It learns the personalized talking style from a reference video of about 10 seconds. It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z)
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder. In particular, our style includes the generation of head poses, thereby enhancing user perception. We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z)
Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z)
StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles [43.12918949398099]
We propose a one-shot style-controllable talking face generation framework. We aim to attain a speaking style from an arbitrary reference speaking video. We then drive the one-shot portrait to speak with the reference speaking style and another piece of audio.
arXiv Detail & Related papers (2023-01-03T13:16:24Z)
Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis [17.650661515807993]
We propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. We devise a latent-style-fusion(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes.
arXiv Detail & Related papers (2021-10-30T08:15:27Z)
MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.