Talking Face Generation with Multilingual TTS
- URL: http://arxiv.org/abs/2205.06421v1
- Date: Fri, 13 May 2022 02:08:35 GMT
- Title: Talking Face Generation with Multilingual TTS
- Authors: Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae
Cho, Youseong Lee, Dongho Choi, Kang-wook Kim
- Abstract summary: We propose a system combining a talking face generation system with a text-to-speech system.
Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker.
For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber.
- Score: 0.8229645116651871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose a joint system combining a talking face generation
system with a text-to-speech system that can generate multilingual talking face
videos from only the text input. Our system can synthesize natural multilingual
speeches while maintaining the vocal identity of the speaker, as well as lip
movements synchronized to the synthesized speech. We demonstrate the
generalization capabilities of our system by selecting four languages (Korean,
English, Japanese, and Chinese) each from a different language family. We also
compare the outputs of our talking face generation model to outputs of a prior
work that claims multilingual support. For our demo, we add a translation API
to the preprocessing stage and present it in the form of a neural dubber so
that users can utilize the multilingual property of our system more easily.
Related papers
- Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Towards Automatic Face-to-Face Translation [30.841020484914527]
"Face-to-Face Translation" can translate a video of a person speaking in language A into a target language B with realistic lip synchronization.
We build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language.
We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio.
arXiv Detail & Related papers (2020-03-01T06:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.