M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
- URL: http://arxiv.org/abs/2207.00952v1
- Date: Sun, 3 Jul 2022 04:26:53 GMT
- Title: M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation
- Authors: Jinming Zhao, Hao Yang, Ehsan Shareghi, Gholamreza Haffari
- Abstract summary: We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
- Score: 66.92823764664206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech-to-text translation models are often initialized with
pre-trained speech encoder and pre-trained text decoder. This leads to a
significant training gap between pre-training and fine-tuning, largely due to
the modality differences between speech outputs from the encoder and text
inputs to the decoder. In this work, we aim to bridge the modality gap between
speech and text to improve translation quality. We propose M-Adapter, a novel
Transformer-based module, to adapt speech representations to text. While
shrinking the speech sequence, M-Adapter produces features desired for
speech-to-text translation via modelling global and local dependencies of a
speech sequence. Our experimental results show that our model outperforms a
strong baseline by up to 1 BLEU score on the Must-C En$\rightarrow$DE
dataset.\footnote{Our code is available at
https://github.com/mingzi151/w2v2-st.}
Related papers
- Improving End-to-end Speech Translation by Leveraging Auxiliary Speech
and Text Data [38.816953592085156]
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems.
It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text)
arXiv Detail & Related papers (2022-12-04T09:27:56Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.