End-to-end Speech Translation via Cross-modal Progressive Training
- URL: http://arxiv.org/abs/2104.10380v1
- Date: Wed, 21 Apr 2021 06:44:31 GMT
- Title: End-to-end Speech Translation via Cross-modal Progressive Training
- Authors: Rong Ye, Mingxuan Wang, Lei Li
- Abstract summary: Cross Speech-Text Network (XSTNet) is an end-to-end model for speech-to-text translation.
XSTNet takes both speech and text as input and outputs both transcription and translation text.
XSTNet achieves state-of-the-art results on all three language directions with an average BLEU of 27.8, outperforming the previous best method by 3.7 BLEU.
- Score: 12.916100727707809
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech translation models have become a new trend in the research
due to their potential of reducing error propagation. However, these models
still suffer from the challenge of data scarcity. How to effectively make use
of unlabeled or other parallel corpora from machine translation is promising
but still an open problem. In this paper, we propose Cross Speech-Text Network
(XSTNet), an end-to-end model for speech-to-text translation. XSTNet takes both
speech and text as input and outputs both transcription and translation text.
The model benefits from its three key design aspects: a self supervising
pre-trained sub-network as the audio encoder, a multi-task training objective
to exploit additional parallel bilingual text, and a progressive training
procedure. We evaluate the performance of XSTNet and baselines on the MuST-C
En-De/Fr/Ru datasets. XSTNet achieves state-of-the-art results on all three
language directions with an average BLEU of 27.8, outperforming the previous
best method by 3.7 BLEU. The code and the models will be released to the
public.
Related papers
- Improving speech translation by fusing speech and text [24.31233927318388]
We harness the complementary strengths of speech and text, which are disparate modalities.
We propose textbfFuse-textbfSpeech-textbfText (textbfFST), a cross-modal model which supports three distinct input modalities for translation.
arXiv Detail & Related papers (2023-05-23T13:13:48Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect [0.0]
We investigate the feasibility of training monolingual Transformer-based language models for under represented languages.
We show that the use of noisy web crawled data instead of structured data is more convenient for such non-standardized language.
Our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks.
arXiv Detail & Related papers (2021-11-25T15:49:50Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.