Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
- URL: http://arxiv.org/abs/2306.03504v2
- Date: Wed, 2 Aug 2023 09:39:05 GMT
- Title: Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
- Authors: Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin,
Zejun Ma, Zhou Zhao
- Abstract summary: We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
- Score: 66.43223397997559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are interested in a novel task, namely low-resource text-to-talking
avatar. Given only a few-minute-long talking person video with the audio track
as the training data and arbitrary texts as the driving input, we aim to
synthesize high-quality talking portrait videos corresponding to the input
text. This task has broad application prospects in the digital human industry
but has not been technically achieved yet due to two challenges: (1) It is
challenging to mimic the timbre from out-of-domain audio for a traditional
multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and
lip-synchronized talking avatars with limited training data. In this paper, we
introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a
generic zero-shot multi-speaker TTS model that well disentangles the text
content, timbre, and prosody; and (2) embraces recent advances in neural
rendering to achieve realistic audio-driven talking face video generation. With
these designs, our method overcomes the aforementioned two challenges and
achieves to generate identity-preserving speech and realistic talking person
video. Experiments demonstrate that our method could synthesize realistic,
identity-preserving, and audio-visual synchronized talking avatar videos.
Related papers
- Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion [5.483488375189695]
Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style.
Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input.
We present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations.
arXiv Detail & Related papers (2024-09-01T11:51:18Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization.
Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models.
Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AudioVisual Speech Synthesis: A brief literature review [4.148192541851448]
We study the problem of audiovisual speech synthesis, which is the problem of generating an animated talking head given a text as input.
For TTS, we present models that are used to map text to intermediate acoustic representations.
For the talking-head animation problem, we categorize approaches based on whether they produce human faces or anthropomorphic figures.
arXiv Detail & Related papers (2021-02-18T19:13:48Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.