Related papers: Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

URL: http://arxiv.org/abs/2301.03396v2
Date: Sat, 29 Jul 2023 19:45:54 GMT
Title: Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation
Authors: Micha{\l} Stypu{\l}kowski, Konstantinos Vougioukas, Sen He, Maciej Zi\k{e}ba, Stavros Petridis, Maja Pantic
Abstract summary: Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis. We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
Score: 54.68893964373141
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos. Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis and their performance on image and video generation has surpassed that of other generative models. In this work, we present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head. Our solution is capable of hallucinating head movements, facial expressions, such as blinks, and preserving a given background. We evaluate our model on two different datasets, achieving state-of-the-art results on both of them.

Related papers

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model [41.35209566957009]
Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. We introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks.
arXiv Detail & Related papers (2025-02-27T17:49:01Z)
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation [50.66658181705527]
We present DAWN, a framework that enables all-at-once generation of dynamic-length video sequences. DAWN consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements.
arXiv Detail & Related papers (2024-10-17T16:32:36Z)
SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model [66.34929233269409]
Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality. We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-05T06:27:32Z)
FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model [17.011391077181344]
We propose a Facial Decoupled Diffusion model for Talking head generation called FD2Talk. In the initial phase, we design the Diffusion Transformer to accurately predict motion coefficients from raw audio. In the second phase, we encode the reference image to capture appearance textures.
arXiv Detail & Related papers (2024-08-18T07:03:53Z)
Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z)
AnimateMe: 4D Facial Expressions via Diffusion Models [72.63383191654357]
Recent advances in diffusion models have enhanced the capabilities of generative models in 2D animation. We employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space. This facilitates the generation of facial deformations through a mesh-diffusion-based model.
arXiv Detail & Related papers (2024-03-25T21:40:44Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
High-Fidelity and Freely Controllable Talking Head Video Generation [31.08828907637289]
We propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. We introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.
arXiv Detail & Related papers (2023-04-20T09:02:41Z)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder [55.58582254514431]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech. We also introduce pose modelling in speech2latent for pose controllability. Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z)
Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion [34.406907667904996]
We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN) Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image.
arXiv Detail & Related papers (2021-07-20T07:22:42Z)
Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input. We outputs a synthesized high-quality talking face video with personalized head pose. Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.