Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation
- URL: http://arxiv.org/abs/2204.02967v1
- Date: Wed, 6 Apr 2022 17:59:22 GMT
- Title: Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation
- Authors: Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi,
Jiatao Gu, Wei-Ning Hsu, Ann Lee
- Abstract summary: Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
- Score: 76.13334392868208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct speech-to-speech translation (S2ST) models suffer from data scarcity
issues as there exists little parallel S2ST data, compared to the amount of
data available for conventional cascaded systems that consist of automatic
speech recognition (ASR), machine translation (MT), and text-to-speech (TTS)
synthesis. In this work, we explore self-supervised pre-training with unlabeled
speech data and data augmentation to tackle this issue. We take advantage of a
recently proposed speech-to-unit translation (S2UT) framework that encodes
target speech into discrete representations, and transfer pre-training and
efficient partial finetuning techniques that work well for speech-to-text
translation (S2T) to the S2UT domain by studying both speech encoder and
discrete unit decoder pre-training. Our experiments show that self-supervised
pre-training consistently improves model performance compared with multitask
learning with a BLEU gain of 4.3-12.0 under various data setups, and it can be
further combined with data augmentation techniques that apply MT to create
weakly supervised training data. Audio samples are available at:
https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html .
Related papers
- Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis [30.97784092953007]
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition.
TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions.
This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition.
arXiv Detail & Related papers (2024-07-04T16:42:24Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.