Translatotron 3: Speech to Speech Translation with Monolingual Data
- URL: http://arxiv.org/abs/2305.17547v3
- Date: Tue, 16 Jan 2024 08:27:38 GMT
- Title: Translatotron 3: Speech to Speech Translation with Monolingual Data
- Authors: Eliya Nachmani, Alon Levkovitch, Yifan Ding, Chulayuth Asawaroengchai,
Heiga Zen, Michelle Tadmor Ramanovich
- Abstract summary: Translatotron 3 is a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets.
Results show that Translatotron 3 outperforms a baseline cascade system.
- Score: 23.376969078371282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Translatotron 3, a novel approach to unsupervised direct
speech-to-speech translation from monolingual speech-text datasets by combining
masked autoencoder, unsupervised embedding mapping, and back-translation.
Experimental results in speech-to-speech translation tasks between Spanish and
English show that Translatotron 3 outperforms a baseline cascade system,
reporting $18.14$ BLEU points improvement on the synthesized
Unpaired-Conversational dataset. In contrast to supervised approaches that
necessitate real paired data, or specialized modeling to replicate
para-/non-linguistic information such as pauses, speaking rates, and speaker
identity, Translatotron 3 showcases its capability to retain it. Audio samples
can be found at http://google-research.github.io/lingvo-lab/translatotron3
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Translatotron 2: Robust direct speech-to-speech translation [6.3470332633611015]
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end.
Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness.
We propose a new method for retaining the source speaker's voice in the translated speech.
arXiv Detail & Related papers (2021-07-19T07:43:49Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.