UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
- URL: http://arxiv.org/abs/2212.08055v2
- Date: Fri, 26 May 2023 16:07:54 GMT
- Title: UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
- Authors: Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan
Wang, Yu-An Chung, Yun Tang, Ann Lee, Shinji Watanabe, Juan Pino
- Abstract summary: We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
- Score: 64.61596752343837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct speech-to-speech translation (S2ST), in which all components can be
optimized jointly, is advantageous over cascaded approaches to achieve fast
inference with a simplified pipeline. We present a novel two-pass direct S2ST
architecture, UnitY, which first generates textual representations and predicts
discrete acoustic units subsequently. We enhance the model performance by
subword prediction in the first-pass decoder, advanced two-pass decoder
architecture design and search strategy, and better training regularization. To
leverage large amounts of unlabeled text data, we pre-train the first-pass text
decoder based on the self-supervised denoising auto-encoding task. Experimental
evaluations on benchmark datasets at various data scales demonstrate that UnitY
outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU
with 2.83x decoding speed-up. We show that the proposed methods boost the
performance even when predicting spectrogram in the second pass. However,
predicting discrete units achieves 2.51x decoding speed-up compared to that
case.
Related papers
- DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.