Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages
- URL: http://arxiv.org/abs/2205.01086v1
- Date: Mon, 2 May 2022 17:59:02 GMT
- Title: Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages
- Authors: Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu Han, Ryan McDonald,
Kilian Q. Weinberger, Yoav Artzi
- Abstract summary: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
- Score: 58.43299730989809
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Wav2Seq, the first self-supervised approach to pre-train both
parts of encoder-decoder models for speech data. We induce a pseudo language as
a compact discrete representation, and formulate a self-supervised pseudo
speech recognition task -- transcribing audio inputs into pseudo subword
sequences. This process stands on its own, or can be applied as low-cost
second-stage pre-training. We experiment with automatic speech recognition
(ASR), spoken named entity recognition, and speech-to-text translation. We set
new state-of-the-art results for end-to-end spoken named entity recognition,
and show consistent improvements on 20 language pairs for speech-to-text
translation, even when competing methods use additional text data for training.
Finally, on ASR, our approach enables encoder-decoder methods to benefit from
pre-training for all parts of the network, and shows comparable performance to
highly optimized recent methods.
Related papers
- Understanding Shared Speech-Text Representations [34.45772613231558]
Mae-stro has developed approaches to train speech models by incorpo-rating text into end-to-end models.
We find that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation.
We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders.
arXiv Detail & Related papers (2023-04-27T20:05:36Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training [33.02912456062474]
We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
arXiv Detail & Related papers (2021-10-20T00:59:36Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.