Audiovisual Speech Synthesis using Tacotron2
- URL: http://arxiv.org/abs/2008.00620v2
- Date: Mon, 30 Aug 2021 02:54:46 GMT
- Title: Audiovisual Speech Synthesis using Tacotron2
- Authors: Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright,
Gabriele Fanelli, Justin Binder, Yannis Stylianou, Sachin Kajarekar
- Abstract summary: We propose and compare two audiovisual speech synthesis systems for 3D face models.
AVTacotron2 is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture.
The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2.
- Score: 14.206988023567828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audiovisual speech synthesis is the problem of synthesizing a talking face
while maximizing the coherency of the acoustic and visual speech. In this
paper, we propose and compare two audiovisual speech synthesis systems for 3D
face models. The first system is the AVTacotron2, which is an end-to-end
text-to-audiovisual speech synthesizer based on the Tacotron2 architecture.
AVTacotron2 converts a sequence of phonemes representing the sentence to
synthesize into a sequence of acoustic features and the corresponding
controllers of a face model. The output acoustic features are used to condition
a WaveRNN to reconstruct the speech waveform, and the output facial controllers
are used to generate the corresponding video of the talking face. The second
audiovisual speech synthesis system is modular, where acoustic speech is
synthesized from text using the traditional Tacotron2. The reconstructed
acoustic speech signal is then used to drive the facial controls of the face
model using an independently trained audio-to-facial-animation neural network.
We further condition both the end-to-end and modular approaches on emotion
embeddings that encode the required prosody to generate emotional audiovisual
speech. We analyze the performance of the two systems and compare them to the
ground truth videos using subjective evaluation tests. The end-to-end and
modular systems are able to synthesize close to human-like audiovisual speech
with mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to a MOS
of 4.1 for the ground truth generated from professionally recorded videos.
While the end-to-end system gives a better overall quality, the modular
approach is more flexible and the quality of acoustic speech and visual speech
synthesis is almost independent of each other.
Related papers
- CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.
Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.
We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - ReVISE: Self-Supervised Speech Resynthesis with Visual Input for
Universal and Generalized Speech Enhancement [40.29155338515071]
ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis.
It achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model.
arXiv Detail & Related papers (2022-12-21T21:36:52Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.