AudioVisual Speech Synthesis: A brief literature review
- URL: http://arxiv.org/abs/2103.03927v1
- Date: Thu, 18 Feb 2021 19:13:48 GMT
- Title: AudioVisual Speech Synthesis: A brief literature review
- Authors: Efthymios Georgiou, Athanasios Katsamanis
- Abstract summary: We study the problem of audiovisual speech synthesis, which is the problem of generating an animated talking head given a text as input.
For TTS, we present models that are used to map text to intermediate acoustic representations.
For the talking-head animation problem, we categorize approaches based on whether they produce human faces or anthropomorphic figures.
- Score: 4.148192541851448
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This brief literature review studies the problem of audiovisual speech
synthesis, which is the problem of generating an animated talking head given a
text as input. Due to the high complexity of this problem, we approach it as
the composition of two problems. Specifically, that of Text-to-Speech (TTS)
synthesis as well as the voice-driven talking head animation. For TTS, we
present models that are used to map text to intermediate acoustic
representations, e.g. mel-spectrograms, as well as models that generate voice
signals conditioned on these intermediate representations, i.e vocoders. For
the talking-head animation problem, we categorize approaches based on whether
they produce human faces or anthropomorphic figures. An attempt is also made to
discuss the importance of the choice of facial models in the second case.
Throughout the review, we briefly describe the most important work in
audiovisual speech synthesis, trying to highlight the advantages and
disadvantages of the various approaches.
Related papers
- Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.