Translatotron 2: Robust direct speech-to-speech translation
- URL: http://arxiv.org/abs/2107.08661v1
- Date: Mon, 19 Jul 2021 07:43:49 GMT
- Title: Translatotron 2: Robust direct speech-to-speech translation
- Authors: Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz
- Abstract summary: We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end.
Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness.
We propose a new method for retaining the source speaker's voice in the translated speech.
- Score: 6.3470332633611015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Translatotron 2, a neural direct speech-to-speech translation
model that can be trained end-to-end. Translatotron 2 consists of a speech
encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention
module that connects all the previous three components. Experimental results
suggest that Translatotron 2 outperforms the original Translatotron by a large
margin in terms of translation quality and predicted speech naturalness, and
drastically improves the robustness of the predicted speech by mitigating
over-generation, such as babbling or long pause. We also propose a new method
for retaining the source speaker's voice in the translated speech. The trained
model is restricted to retain the source speaker's voice, and unlike the
original Translatotron, it is not able to generate speech in a different
speaker's voice, making the model more robust for production deployment, by
mitigating potential misuse for creating spoofing audio artifacts. When the new
method is used together with a simple concatenation-based data augmentation,
the trained Translatotron 2 model is able to retain each speaker's voice for
input with speaker turns.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Translatotron 3: Speech to Speech Translation with Monolingual Data [23.376969078371282]
Translatotron 3 is a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets.
Results show that Translatotron 3 outperforms a baseline cascade system.
arXiv Detail & Related papers (2023-05-27T18:30:54Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.