Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism
- URL: http://arxiv.org/abs/2312.06613v1
- Date: Mon, 11 Dec 2023 18:41:55 GMT
- Title: Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism
- Authors: Georgios Milis, Panagiotis P. Filntisis, Anastasios Roussos, Petros
Maragos
- Abstract summary: State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
- Score: 26.180371869137257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in deep learning for sequential data have given rise to fast
and powerful models that produce realistic videos of talking humans. The state
of the art in talking face generation focuses mainly on lip-syncing, being
conditioned on audio clips. However, having the ability to synthesize talking
humans from text transcriptions rather than audio is particularly beneficial
for many applications and is expected to receive more and more attention,
following the recent breakthroughs in large language models. For that, most
methods implement a cascaded 2-stage architecture of a text-to-speech module
followed by an audio-driven talking face generator, but this ignores the highly
complex interplay between audio and visual streams that occurs during speaking.
In this paper, we propose the first, to the best of our knowledge, text-driven
audiovisual speech synthesizer that uses Transformers and does not follow a
cascaded approach. Our method, which we call NEUral Text to ARticulate Talk
(NEUTART), is a talking face generator that uses a joint audiovisual feature
space, as well as speech-informed 3D facial reconstructions and a lip-reading
loss for visual supervision. The proposed model produces photorealistic talking
face videos with human-like articulation and well-synced audiovisual streams.
Our experiments on audiovisual datasets as well as in-the-wild videos reveal
state-of-the-art generation quality both in terms of objective metrics and
human evaluation.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - Write-a-speaker: Text-based Emotional and Rhythmic Talking-head
Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions.
Our framework consists of a speaker-independent stage and a speaker-specific stage.
Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.