Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation
- URL: http://arxiv.org/abs/2211.00115v1
- Date: Mon, 31 Oct 2022 19:48:38 GMT
- Title: Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation
- Authors: Xinjian Li, Ye Jia, Chung-Cheng Chiu
- Abstract summary: We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
- Score: 27.182170555234226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research on speech-to-speech translation (S2ST) has progressed rapidly in
recent years. Many end-to-end systems have been proposed and show advantages
over conventional cascade systems, which are often composed of recognition,
translation and synthesis sub-systems. However, most of the end-to-end systems
still rely on intermediate textual supervision during training, which makes it
infeasible to work for languages without written forms. In this work, we
propose a novel model, Textless Translatotron, which is based on Translatotron
2, for training an end-to-end direct S2ST model without any textual
supervision. Instead of jointly training with an auxiliary task predicting
target phonemes as in Translatotron 2, the proposed model uses an auxiliary
task predicting discrete speech representations which are obtained from learned
or random speech quantizers. When a speech encoder pre-trained with
unsupervised speech data is used for both models, the proposed model obtains
translation quality nearly on-par with Translatotron 2 on the multilingual
CVSS-C corpus as well as the bilingual Fisher Spanish-English corpus. On the
latter, it outperforms the prior state-of-the-art textless model by +18.5 BLEU.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.