CTC-based Non-autoregressive Textless Speech-to-Speech Translation
- URL: http://arxiv.org/abs/2406.07330v1
- Date: Tue, 11 Jun 2024 15:00:33 GMT
- Title: CTC-based Non-autoregressive Textless Speech-to-Speech Translation
- Authors: Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng,
- Abstract summary: Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding.
Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly.
In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation.
- Score: 38.99922762754443
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81$\times$ decoding speedup.
Related papers
- DASpeech: Directed Acyclic Transformer for Fast and High-quality
Speech-to-Speech Translation [36.126810842258706]
Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model.
Due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution.
We propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST.
arXiv Detail & Related papers (2023-10-11T11:39:36Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.