SLM: Bridge the thin gap between speech and text foundation models
- URL: http://arxiv.org/abs/2310.00230v1
- Date: Sat, 30 Sep 2023 02:27:45 GMT
- Title: SLM: Bridge the thin gap between speech and text foundation models
- Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan
Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein,
Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan
Schalkwyk, Yonghui Wu
- Abstract summary: Speech and Language Model (SLM) is a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models.
We show that SLM is efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
- Score: 45.319071954143325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a joint Speech and Language Model (SLM), a multitask,
multilingual, and dual-modal model that takes advantage of pretrained
foundational speech and language models. SLM freezes the pretrained foundation
models to maximally preserves their capabilities, and only trains a simple
adapter with just 1\% (156M) of the foundation models' parameters. This
adaptation not only leads SLM to achieve strong performance on conventional
tasks such as speech recognition (ASR) and speech translation (AST), but also
introduces the novel capability of zero-shot instruction-following for more
diverse tasks: given a speech input and a text instruction, SLM is able to
perform unseen generation tasks including contextual biasing ASR using
real-time context, dialog generation, speech continuation, and question
answering, etc. Our approach demonstrates that the representational gap between
pretrained speech and language models might be narrower than one would expect,
and can be bridged by a simple adaptation mechanism. As a result, SLM is not
only efficient to train, but also inherits strong capabilities already acquired
in foundation models of different modalities.
Related papers
- Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing [17.92378239787507]
We present a decoder-only Discrete Multimodal Language Model (DMLM)
DMLM can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision)
Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training.
arXiv Detail & Related papers (2024-06-04T20:08:25Z) - SpeechVerse: A Large-scale Generalizable Audio Language Model [38.67969337605572]
SpeechVerse is a robust multi-task training and curriculum learning framework.
It combines pre-trained speech and text foundation models via a small set of learnable parameters.
Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
arXiv Detail & Related papers (2024-05-14T03:33:31Z) - SpiRit-LM: Interleaved Spoken and Written Language Model [45.44798658207754]
SPIRIT-LM is a foundation multimodal language model that freely mixes text and speech.
Model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units.
arXiv Detail & Related papers (2024-02-08T15:39:32Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.