SLM: Bridge the thin gap between speech and text foundation models
- URL: http://arxiv.org/abs/2310.00230v1
- Date: Sat, 30 Sep 2023 02:27:45 GMT
- Title: SLM: Bridge the thin gap between speech and text foundation models
- Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan
Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein,
Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan
Schalkwyk, Yonghui Wu
- Abstract summary: Speech and Language Model (SLM) is a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models.
We show that SLM is efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.
- Score: 45.319071954143325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a joint Speech and Language Model (SLM), a multitask,
multilingual, and dual-modal model that takes advantage of pretrained
foundational speech and language models. SLM freezes the pretrained foundation
models to maximally preserves their capabilities, and only trains a simple
adapter with just 1\% (156M) of the foundation models' parameters. This
adaptation not only leads SLM to achieve strong performance on conventional
tasks such as speech recognition (ASR) and speech translation (AST), but also
introduces the novel capability of zero-shot instruction-following for more
diverse tasks: given a speech input and a text instruction, SLM is able to
perform unseen generation tasks including contextual biasing ASR using
real-time context, dialog generation, speech continuation, and question
answering, etc. Our approach demonstrates that the representational gap between
pretrained speech and language models might be narrower than one would expect,
and can be bridged by a simple adaptation mechanism. As a result, SLM is not
only efficient to train, but also inherits strong capabilities already acquired
in foundation models of different modalities.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - LAST: Language Model Aware Speech Tokenization [24.185165710384997]
We propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs.
Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs.
arXiv Detail & Related papers (2024-09-05T16:57:39Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.