Curriculum Pre-training for End-to-End Speech Translation
- URL: http://arxiv.org/abs/2004.10093v1
- Date: Tue, 21 Apr 2020 15:12:07 GMT
- Title: Curriculum Pre-training for End-to-End Speech Translation
- Authors: Chengyi Wang, Yu Wu, Shujie Liu, Ming Zhou and Zhenglu Yang
- Abstract summary: We propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages.
Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.
- Score: 51.53031035374276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech translation poses a heavy burden on the encoder, because it
has to transcribe, understand, and learn cross-lingual semantics
simultaneously. To obtain a powerful encoder, traditional methods pre-train it
on ASR data to capture speech features. However, we argue that pre-training the
encoder only through simple speech recognition is not enough and high-level
linguistic knowledge should be considered. Inspired by this, we propose a
curriculum pre-training method that includes an elementary course for
transcription learning and two advanced courses for understanding the utterance
and mapping words in two languages. The difficulty of these courses is
gradually increasing. Experiments show that our curriculum pre-training method
leads to significant improvements on En-De and En-Fr speech translation
benchmarks.
Related papers
- Unveiling the Role of Pretraining in Direct Speech Translation [14.584351239812394]
We compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch.
We observe that, throughout the training, the randomly model struggles to incorporate information from the speech inputs for its predictions.
We propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training.
arXiv Detail & Related papers (2024-09-26T16:46:46Z) - Gujarati-English Code-Switching Speech Recognition using ensemble
prediction of spoken language [29.058108207186816]
We propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism.
Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data.
arXiv Detail & Related papers (2024-03-12T18:21:20Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.