TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation
- URL: http://arxiv.org/abs/2312.15197v1
- Date: Sat, 23 Dec 2023 08:45:57 GMT
- Title: TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation
- Authors: Xize Cheng, Rongjie Huang, Linjun Li, Tao Jin, Zehan Wang, Aoxiong
Yin, Minglei Li, Xinyu Duan, changpeng yang, Zhou Zhao
- Abstract summary: Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
- Score: 54.155138561698514
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Direct speech-to-speech translation achieves high-quality results through the
introduction of discrete units obtained from self-supervised learning. This
approach circumvents delays and cascading errors associated with model
cascading. However, talking head translation, converting audio-visual speech
(i.e., talking head video) from one language into another, still confronts
several challenges compared to audio speech: (1) Existing methods invariably
rely on cascading, synthesizing via both audio and text, resulting in delays
and cascading errors. (2) Talking head translation has a limited set of
reference frames. If the generated translation exceeds the length of the
original speech, the video sequence needs to be supplemented by repeating
frames, leading to jarring video transitions. In this work, we propose a model
for talking head translation, \textbf{TransFace}, which can directly translate
audio-visual speech into audio-visual speech in other languages. It consists of
a speech-to-unit translation model to convert audio speech into discrete units
and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize
synchronized audio-visual speech from discrete units in parallel. Furthermore,
we introduce a Bounded Duration Predictor, ensuring isometric talking head
translation and preventing duplicate reference frames. Experiments demonstrate
that our proposed Unit2Lip model significantly improves synchronization (1.601
and 0.982 on LSE-C for the original and generated audio speech, respectively)
and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace
achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on
LRS3-T and 100% isochronous translations.
Related papers
- A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Translatotron 3: Speech to Speech Translation with Monolingual Data [23.376969078371282]
Translatotron 3 is a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets.
Results show that Translatotron 3 outperforms a baseline cascade system.
arXiv Detail & Related papers (2023-05-27T18:30:54Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.