Unveiling the Role of Pretraining in Direct Speech Translation
- URL: http://arxiv.org/abs/2409.18044v1
- Date: Thu, 26 Sep 2024 16:46:46 GMT
- Title: Unveiling the Role of Pretraining in Direct Speech Translation
- Authors: Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà,
- Abstract summary: We compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch.
We observe that, throughout the training, the randomly model struggles to incorporate information from the speech inputs for its predictions.
We propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training.
- Score: 14.584351239812394
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.
Related papers
- Inter-connection: Effective Connection between Pre-trained Encoder and
Decoder for Speech Translation [10.103202030679844]
We propose an inter-connection mechanism that aggregates the information from each layer of the speech pre-trained model.
This mechanism increased BLEU by approximately 2 points in en-de, en-ja, and en-zh by increasing parameters by 2K when the speech pre-trained model was frozen.
arXiv Detail & Related papers (2023-05-26T13:01:29Z) - INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced
Non-Native Speech Recognition [43.228070238684786]
We propose Information Theoretic Adversarial Prompt Tuning (INTapt) to mitigate representational bias in automatic speech recognition systems.
INTapt is trained simultaneously in the following two manners: (1) adversarial training to reduce accent feature dependence between the original input and the prompt-concatenated input, and (2) training to minimize CTC loss for improving ASR performance to a prompt-concatenated input.
Experimental results show that INTapt improves the performance of L2 English and increases feature similarity between L2 and L1 accents.
arXiv Detail & Related papers (2023-05-25T13:06:01Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Instance Regularization for Discriminative Language Model Pre-training [108.41891836796366]
This work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training.
Experimental results on natural language understanding and reading comprehension benchmarks show that our approach improves pre-training efficiency, effectiveness, and robustness.
arXiv Detail & Related papers (2022-10-11T14:16:37Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - SPIRAL: Self-supervised Perturbation-Invariant Representation Learning
for Speech Pre-Training [25.80559992732508]
SPIRAL works by learning denoising representation of perturbed data in a teacher-student framework.
We address the problem of noise-robustness that is critical to real-world speech applications.
arXiv Detail & Related papers (2022-01-25T09:53:36Z) - Audio Captioning using Pre-Trained Large-Scale Language Model Guided by
Audio-based Similar Caption Retrieval [28.57294189207084]
The goal of audio captioning is to translate input audio into its description using natural language.
The proposed method has succeeded to use a pre-trained language model for audio captioning.
The oracle performance of the pre-trained model-based caption generator was clearly better than that of the conventional method trained from scratch.
arXiv Detail & Related papers (2020-12-14T08:27:36Z) - Curriculum Pre-training for End-to-End Speech Translation [51.53031035374276]
We propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages.
Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.
arXiv Detail & Related papers (2020-04-21T15:12:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.