BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
- URL: http://arxiv.org/abs/2405.19041v1
- Date: Wed, 29 May 2024 12:32:08 GMT
- Title: BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
- Authors: Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang,
- Abstract summary: We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation.
It optimize speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs.
It also employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment.
- Score: 18.329192763760034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - LAST: Language Model Aware Speech Tokenization [24.185165710384997]
We propose a novel approach to training a speech tokenizer by leveraging objectives from pre-trained textual LMs.
Our aim is to transform features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs.
arXiv Detail & Related papers (2024-09-05T16:57:39Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - Prompting Large Language Models with Audio for General-Purpose Speech Summarization [13.415189715216354]
We introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs)
We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret.
arXiv Detail & Related papers (2024-06-10T02:04:28Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
modality alignment between speech and text remains an open problem.
We propose the BLSP approach that bootstraps Language-Speech Pre-training via behavior alignment of continuation writing.
We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
arXiv Detail & Related papers (2023-09-02T11:46:05Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Assessing Phrase Break of ESL Speech with Pre-trained Language Models
and Large Language Models [7.782346535009883]
This work introduces approaches to assessing phrase breaks in ESL learners' speech using pre-trained language models (PLMs) and large language models (LLMs)
arXiv Detail & Related papers (2023-06-08T07:10:39Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in
End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis.
We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder.
Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.