Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for
Speech Understanding
- URL: http://arxiv.org/abs/2306.07944v1
- Date: Thu, 8 Jun 2023 22:33:22 GMT
- Title: Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for
Speech Understanding
- Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu,
Laurent El Shafey
- Abstract summary: We propose a joint speech and language model (SLM) using a Speech2Text adapter.
SLM maps speech into text token embedding space without speech information loss.
In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance.
- Score: 13.527613396601268
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have been applied in the speech domain, often
incurring a performance drop due to misaligned between speech and language
representations. To bridge this gap, we propose a joint speech and language
model (SLM) using a Speech2Text adapter, which maps speech into text token
embedding space without speech information loss. Additionally, using a
CTC-based blank-filtering, we can reduce the speech sequence length to that of
text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the
dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to
address errors on rare entities, we augment SLM with a Speech2Entity retriever,
which uses speech to retrieve relevant entities, and then adds them to the
original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the
DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with
the dialog understanding task improves the ASR performance from 9.4% to 8.5%
WER.
Related papers
- VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities.
Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Retrieval Augmented End-to-End Spoken Dialog Models [20.896330994089283]
We apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness.
We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance.
arXiv Detail & Related papers (2024-02-02T18:23:09Z) - SALM: Speech-augmented Language Model with In-context Learning for
Speech Recognition and Translation [26.778332992311043]
We present a novel Speech Augmented Language Model (SALM) with em multitask and em in-context learning capabilities.
The unified SALM achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST)
arXiv Detail & Related papers (2023-10-13T22:07:33Z) - Voxtlm: unified decoder-only models for consolidating speech
recognition/synthesis and speech/text continuation tasks [61.3055230762097]
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation.
VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning.
arXiv Detail & Related papers (2023-09-14T03:13:18Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.