Inter-connection: Effective Connection between Pre-trained Encoder and
Decoder for Speech Translation
- URL: http://arxiv.org/abs/2305.16897v1
- Date: Fri, 26 May 2023 13:01:29 GMT
- Title: Inter-connection: Effective Connection between Pre-trained Encoder and
Decoder for Speech Translation
- Authors: Yuta Nishikawa, Satoshi Nakamura
- Abstract summary: We propose an inter-connection mechanism that aggregates the information from each layer of the speech pre-trained model.
This mechanism increased BLEU by approximately 2 points in en-de, en-ja, and en-zh by increasing parameters by 2K when the speech pre-trained model was frozen.
- Score: 10.103202030679844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In end-to-end speech translation, speech and text pre-trained models improve
translation quality. Recently proposed models simply connect the pre-trained
models of speech and text as encoder and decoder. Therefore, only the
information from the final layer of encoders is input to the decoder. Since it
is clear that the speech pre-trained model outputs different information from
each layer, the simple connection method cannot fully utilize the information
that the speech pre-trained model has. In this study, we propose an
inter-connection mechanism that aggregates the information from each layer of
the speech pre-trained model by weighted sums and inputs into the decoder. This
mechanism increased BLEU by approximately 2 points in en-de, en-ja, and en-zh
by increasing parameters by 2K when the speech pre-trained model was frozen.
Furthermore, we investigated the contribution of each layer for each language
by visualizing layer weights and found that the contributions were different.
Related papers
- Unveiling the Role of Pretraining in Direct Speech Translation [14.584351239812394]
We compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch.
We observe that, throughout the training, the randomly model struggles to incorporate information from the speech inputs for its predictions.
We propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training.
arXiv Detail & Related papers (2024-09-26T16:46:46Z) - CoLLD: Contrastive Layer-to-layer Distillation for Compressing
Multilingual Pre-trained Speech Encoders [19.32466171141613]
Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks.
Building new encoders for new tasks and deploying them to on-device applications are infeasible.
We propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders.
arXiv Detail & Related papers (2023-09-14T13:38:02Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.