DiT-Head: High-Resolution Talking Head Synthesis using Diffusion
Transformers
- URL: http://arxiv.org/abs/2312.06400v1
- Date: Mon, 11 Dec 2023 14:09:56 GMT
- Title: DiT-Head: High-Resolution Talking Head Synthesis using Diffusion
Transformers
- Authors: Aaron Mir, Eduardo Alonso and Esther Mondrag\'on
- Abstract summary: "DiT-Head" is based on diffusion transformers and uses audio as a condition to drive the denoising process of a diffusion model.
We train and evaluate our proposed approach and compare it against existing methods of talking head synthesis.
- Score: 2.1408617023874443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel talking head synthesis pipeline called "DiT-Head", which
is based on diffusion transformers and uses audio as a condition to drive the
denoising process of a diffusion model. Our method is scalable and can
generalise to multiple identities while producing high-quality results. We
train and evaluate our proposed approach and compare it against existing
methods of talking head synthesis. We show that our model can compete with
these methods in terms of visual quality and lip-sync accuracy. Our results
highlight the potential of our proposed approach to be used for a wide range of
applications, including virtual assistants, entertainment, and education. For a
video demonstration of the results and our user study, please refer to our
supplementary material.
Related papers
- LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details [14.22392871407274]
We present an effective post-processing approach to synthesize photo-realistic talking head videos.
Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities.
Results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
arXiv Detail & Related papers (2024-10-01T18:32:02Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - Parametric Implicit Face Representation for Audio-Driven Facial
Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads.
Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models.
Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Towards Realistic Visual Dubbing with Heterogeneous Sources [22.250010330418398]
Few-shot visual dubbing involves synchronizing the lip movements with arbitrary speech input for any talking head.
We propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data.
Our method makes it possible to independently utilize the training corpus for two-stage sub-networks.
arXiv Detail & Related papers (2022-01-17T07:57:24Z) - Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling
Scheme [4.053320933149689]
The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset.
We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches.
arXiv Detail & Related papers (2021-09-28T15:48:22Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.