Related papers: SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model

URL: http://arxiv.org/abs/2409.03270v1
Date: Thu, 5 Sep 2024 06:27:32 GMT
Title: SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model
Authors: Weipeng Tan, Chuming Lin, Chengming Xu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu,
Abstract summary: Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality. We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
Score: 66.34929233269409
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.

Related papers

Model See Model Do: Speech-Driven Facial Animation with Style Control [14.506128477193991]
Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation.<n>Existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions.<n>We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip.
arXiv Detail & Related papers (2025-05-02T14:47:21Z)
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes [74.82911268630463]
Talking face generation (TFG) aims to animate a target identity's face to create realistic talking videos. MimicTalk exploits the rich knowledge from a NeRF-based person-agnostic generic model for improving the efficiency and robustness of personalized TFG. Experiments show that our MimicTalk surpasses previous baselines regarding video quality, efficiency, and expressiveness.
arXiv Detail & Related papers (2024-10-09T10:12:37Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
Say Anything with Any Style [9.50806457742173]
Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook. Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
arXiv Detail & Related papers (2024-03-11T01:20:03Z)
HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models [84.12784265734238]
The goal of Arbitrary Style Transfer (AST) is injecting the artistic features of a style reference into a given image/video. We propose HiCAST, which is capable of explicitly customizing the stylization results according to various source of semantic clues. A novel learning objective is leveraged for video diffusion model training, which significantly improve cross-frame temporal consistency.
arXiv Detail & Related papers (2024-01-11T12:26:23Z)
One Style is All you Need to Generate a Video [0.9558392439655012]
We introduce a novel temporal generator based on a set of learned sinusoidal bases. Our method learns dynamic representations of various actions that are independent of image content and can be transferred between different actors.
arXiv Detail & Related papers (2023-10-27T01:17:48Z)
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder. In particular, our style includes the generation of head poses, thereby enhancing user perception. We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z)
Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer [13.098901971644656]
This paper proposes a zero-shot video stylization method named Style-A-Video. Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
arXiv Detail & Related papers (2023-05-09T14:03:27Z)
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis. We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.