Towards Automatic Face-to-Face Translation
- URL: http://arxiv.org/abs/2003.00418v1
- Date: Sun, 1 Mar 2020 06:42:43 GMT
- Title: Towards Automatic Face-to-Face Translation
- Authors: Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay
Namboodiri, C.V. Jawahar
- Abstract summary: "Face-to-Face Translation" can translate a video of a person speaking in language A into a target language B with realistic lip synchronization.
We build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language.
We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio.
- Score: 30.841020484914527
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In light of the recent breakthroughs in automatic machine translation
systems, we propose a novel approach that we term as "Face-to-Face
Translation". As today's digital communication becomes increasingly visual, we
argue that there is a need for systems that can automatically translate a video
of a person speaking in language A into a target language B with realistic lip
synchronization. In this work, we create an automatic pipeline for this problem
and demonstrate its impact on multiple real-world applications. First, we build
a working speech-to-speech translation system by bringing together multiple
existing modules from speech and language. We then move towards "Face-to-Face
Translation" by incorporating a novel visual module, LipGAN for generating
realistic talking faces from the translated audio. Quantitative evaluation of
LipGAN on the standard LRW test set shows that it significantly outperforms
existing approaches across all standard metrics. We also subject our
Face-to-Face Translation pipeline, to multiple human evaluations and show that
it can significantly improve the overall user experience for consuming and
interacting with multimodal content across languages. Code, models and demo
video are made publicly available.
Demo video: https://www.youtube.com/watch?v=aHG6Oei8jF0
Code and models: https://github.com/Rudrabha/LipGAN
Related papers
- GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z) - ChatAnything: Facetime Chat with LLM-Enhanced Personas [87.76804680223003]
We propose the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation.
For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones.
MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects.
arXiv Detail & Related papers (2023-11-12T08:29:41Z) - TRAVID: An End-to-End Video Translation Framework [1.6131714685439382]
We present an end-to-end video translation system that not only translates spoken language but also synchronizes the translated speech with the lip movements of the speaker.
Our system focuses on translating educational lectures in various Indian languages, and it is designed to be effective even in low-resource system settings.
arXiv Detail & Related papers (2023-09-20T14:13:05Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Emotionally Enhanced Talking Face Generation [52.07451348895041]
We build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions.
We show that our model can adapt to arbitrary identities, emotions, and languages.
Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions.
arXiv Detail & Related papers (2023-03-21T02:33:27Z) - Talking Face Generation with Multilingual TTS [0.8229645116651871]
We propose a system combining a talking face generation system with a text-to-speech system.
Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker.
For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber.
arXiv Detail & Related papers (2022-05-13T02:08:35Z) - MeetDot: Videoconferencing with Live Translation Captions [18.60812558978417]
We present MeetDot, a videoconferencing system with live translation captions overlaid on screen.
Our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade.
We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker.
arXiv Detail & Related papers (2021-09-20T14:34:14Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.