Leveraging unsupervised and weakly-supervised data to improve direct
speech-to-speech translation
- URL: http://arxiv.org/abs/2203.13339v1
- Date: Thu, 24 Mar 2022 21:06:15 GMT
- Title: Leveraging unsupervised and weakly-supervised data to improve direct
speech-to-speech translation
- Authors: Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis
Conneau, Nobuyuki Morioka
- Abstract summary: Speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research.
Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets.
In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2.
- Score: 32.24706553793383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech-to-speech translation (S2ST) without relying on
intermediate text representations is a rapidly emerging frontier of research.
Recent works have demonstrated that the performance of such direct S2ST systems
is approaching that of conventional cascade S2ST when trained on comparable
datasets. However, in practice, the performance of direct S2ST is bounded by
the availability of paired S2ST training data. In this work, we explore
multiple approaches for leveraging much more widely available unsupervised and
weakly-supervised speech and text data to improve the performance of direct
S2ST based on Translatotron 2. With our most effective approaches, the average
translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is
improved by +13.6 BLEU (or +113% relatively), as compared to the previous
state-of-the-art trained without additional data. The improvements on
low-resource language are even more significant (+398% relatively on average).
Our comparative studies suggest future research directions for S2ST and speech
representation learning.
Related papers
- Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Improving Speech-to-Speech Translation Through Unlabeled Text [39.28273721043411]
Direct speech-to-speech translation (S2ST) is among the most challenging problems in the translation paradigm.
We propose an effective way to utilize the massive existing unlabeled text from different languages to create a large amount of S2ST data.
arXiv Detail & Related papers (2022-10-26T06:52:19Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.