Regularizing End-to-End Speech Translation with Triangular Decomposition
Agreement
- URL: http://arxiv.org/abs/2112.10991v1
- Date: Tue, 21 Dec 2021 05:24:01 GMT
- Title: Regularizing End-to-End Speech Translation with Triangular Decomposition
Agreement
- Authors: Yichao Du, Zhirui Zhang, Weizhi Wang, Boxing Chen, Jun Xie, Tong Xu
- Abstract summary: We propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data.
Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines.
- Score: 27.87144563354033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech-to-text translation~(E2E-ST) is becoming increasingly
popular due to the potential of its less error propagation, lower latency, and
fewer parameters. Given the triplet training corpus $\langle speech,
transcription, translation\rangle$, the conventional high-quality E2E-ST system
leverages the $\langle speech, transcription\rangle$ pair to pre-train the
model and then utilizes the $\langle speech, translation\rangle$ pair to
optimize it further. However, this process only involves two-tuple data at each
stage, and this loose coupling fails to fully exploit the association between
triplet data. In this paper, we attempt to model the joint probability of
transcription and translation based on the speech input to directly leverage
such triplet data. Based on that, we propose a novel regularization method for
model training to improve the agreement of dual-path decomposition within
triplet data, which should be equal in theory. To achieve this goal, we
introduce two Kullback-Leibler divergence regularization terms into the model
training objective to reduce the mismatch between output probabilities of
dual-path. Then the well-trained model can be naturally transformed as the
E2E-ST models by the pre-defined early stop tag. Experiments on the MuST-C
benchmark demonstrate that our proposed approach significantly outperforms
state-of-the-art E2E-ST baselines on all 8 language pairs, while achieving
better performance in the automatic speech recognition task. Our code is
open-sourced at https://github.com/duyichao/E2E-ST-TDA.
Related papers
- Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - End-to-end Speech Translation via Cross-modal Progressive Training [12.916100727707809]
Cross Speech-Text Network (XSTNet) is an end-to-end model for speech-to-text translation.
XSTNet takes both speech and text as input and outputs both transcription and translation text.
XSTNet achieves state-of-the-art results on all three language directions with an average BLEU of 27.8, outperforming the previous best method by 3.7 BLEU.
arXiv Detail & Related papers (2021-04-21T06:44:31Z) - Decoupling Pronunciation and Language for End-to-end Code-switching
Automatic Speech Recognition [66.47000813920617]
We propose a decoupled transformer model to use monolingual paired data and unpaired text data.
The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network.
By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model.
arXiv Detail & Related papers (2020-10-28T07:46:15Z) - Unsupervised Pretraining for Neural Machine Translation Using Elastic
Weight Consolidation [0.0]
This work presents our ongoing research of unsupervised pretraining in neural machine translation (NMT)
In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data.
We show that initializing the bidirectional NMT encoder with a left-to-right language model and forcing the model to remember the original left-to-right language modeling task limits the learning capacity of the encoder.
arXiv Detail & Related papers (2020-10-19T11:51:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.