Talking Head from Speech Audio using a Pre-trained Image Generator
- URL: http://arxiv.org/abs/2209.04252v1
- Date: Fri, 9 Sep 2022 11:20:37 GMT
- Title: Talking Head from Speech Audio using a Pre-trained Image Generator
- Authors: Mohammed M. Alghamdi, He Wang, Andrew J. Bulpitt, David C. Hogg
- Abstract summary: We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image.
We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space.
We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator.
- Score: 5.659018934205065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel method for generating high-resolution videos of
talking-heads from speech audio and a single 'identity' image. Our method is
based on a convolutional neural network model that incorporates a pre-trained
StyleGAN generator. We model each frame as a point in the latent space of
StyleGAN so that a video corresponds to a trajectory through the latent space.
Training the network is in two stages. The first stage is to model trajectories
in the latent space conditioned on speech utterances. To do this, we use an
existing encoder to invert the generator, mapping from each video frame into
the latent space. We train a recurrent neural network to map from speech
utterances to displacements in the latent space of the image generator. These
displacements are relative to the back-projection into the latent space of an
identity image chosen from the individuals depicted in the training dataset. In
the second stage, we improve the visual quality of the generated videos by
tuning the image generator on a single image or a short video of any chosen
identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD)
and show that it significantly outperforms recent state-of-the-art methods on
one of two commonly used datasets and gives comparable performance on the
other. Finally, we report on ablation experiments that validate the components
of the model. The code and videos from experiments can be found at
https://mohammedalghamdi.github.io/talking-heads-acm-mm
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder.
In particular, our style includes the generation of head poses, thereby enhancing user perception.
We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with
Diffusion Autoencoder [20.814063371439904]
We propose DAE-Talker to synthesize full video frames and produce natural head movements that align with the content of speech.
We also introduce pose modelling in speech2latent for pose controllability.
Our experiments show that DAE-Talker outperforms existing popular methods in lip-sync, video fidelity, and pose naturalness.
arXiv Detail & Related papers (2023-03-30T17:18:31Z) - HQ3DAvatar: High Quality Controllable 3D Head Avatar [65.70885416855782]
This paper presents a novel approach to building highly photorealistic digital head avatars.
Our method learns a canonical space via an implicit function parameterized by a neural network.
At test time, our method is driven by a monocular RGB video.
arXiv Detail & Related papers (2023-03-25T13:56:33Z) - ViewFormer: NeRF-free Neural Rendering from Few Images Using
Transformers [34.4824364161812]
Novel view synthesis is a problem where we are given only a few context views sparsely covering a scene or an object.
The goal is to predict novel viewpoints in the scene, which requires learning priors.
We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network.
arXiv Detail & Related papers (2022-03-18T21:08:23Z) - Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation [61.8546794105462]
We propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF.
We first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering.
To enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions.
arXiv Detail & Related papers (2022-01-19T18:54:41Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z) - Everybody's Talkin': Let Me Talk as You Want [134.65914135774605]
We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video.
It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output.
arXiv Detail & Related papers (2020-01-15T09:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.