SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training
- URL: http://arxiv.org/abs/2110.10329v1
- Date: Wed, 20 Oct 2021 00:59:36 GMT
- Title: SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
Joint Pre-Training
- Authors: Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H.
Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang
- Abstract summary: We build a single encoder with the BERT objective on unlabeled text together with the w2v-BERT objective on unlabeled speech.
We demonstrate that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST2 speech translation.
- Score: 33.02912456062474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised pre-training is now the predominant approach for both text and
speech understanding. Self-attention models pre-trained on large amounts of
unannotated data have been hugely successful when fine-tuned on downstream
tasks from a variety of domains and languages. This paper takes the
universality of unsupervised language pre-training one step further, by
unifying speech and text pre-training within a single model. We build a single
encoder with the BERT objective on unlabeled text together with the w2v-BERT
objective on unlabeled speech. To further align our model representations
across modalities, we leverage alignment losses, specifically Translation
Language Modeling (TLM) and Speech Text Matching (STM) that make use of
supervised speech-text recognition data. We demonstrate that incorporating both
speech and text data during pre-training can significantly improve downstream
quality on CoVoST~2 speech translation, by around 1 BLEU compared to
single-modality pre-trained models, while retaining close to SotA performance
on LibriSpeech and SpeechStew ASR tasks. On four GLUE tasks and
text-normalization, we observe evidence of capacity limitations and
interference between the two modalities, leading to degraded performance
compared to an equivalent text-only model, while still being competitive with
BERT. Through extensive empirical analysis we also demonstrate the importance
of the choice of objective function for speech pre-training, and the beneficial
effect of adding additional supervised signals on the quality of the learned
representations.
Related papers
- Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction.
Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data.
We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z) - Few-Shot Spoken Language Understanding via Joint Speech-Text Models [18.193191170754744]
Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations.
We leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks.
By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data.
arXiv Detail & Related papers (2023-10-09T17:59:21Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.