Retrieval Augmented End-to-End Spoken Dialog Models
- URL: http://arxiv.org/abs/2402.01828v1
- Date: Fri, 2 Feb 2024 18:23:09 GMT
- Title: Retrieval Augmented End-to-End Spoken Dialog Models
- Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu,
Laurent El Shafey
- Abstract summary: We apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness.
We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance.
- Score: 20.896330994089283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We recently developed SLM, a joint speech and language model, which fuses a
pretrained foundational speech model and a large language model (LLM), while
preserving the in-context learning capability intrinsic to the pretrained LLM.
In this paper, we apply SLM to speech dialog applications where the dialog
states are inferred directly from the audio signal.
Task-oriented dialogs often contain domain-specific entities, i.e.,
restaurants, hotels, train stations, and city names, which are difficult to
recognize, however, critical for the downstream applications. Inspired by the
RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented
SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to
retrieve text entities mentioned in the audio. The retrieved entities are then
added as text inputs to the underlying SLM to bias model predictions. We
evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that
this retrieval augmentation boosts model performance, achieving joint goal
accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error
rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach
is broadly applicable to other speech tasks requiring contextual information or
domain-specific entities, such as contextual ASR with biasing capability.
Related papers
- Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models [16.920823078873095]
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword.
We show on the real-world dataset of follow-up conversations that this approach yields large gains due to the joint modeling of the previous speech context and ASR uncertainty.
arXiv Detail & Related papers (2024-10-28T19:43:43Z) - VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.
Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation [46.93969003104427]
This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM)
USDM is designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech.
Our approach effectively generates natural-sounding spoken responses, surpassing previous and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Are LLMs Robust for Spoken Dialogues? [10.855403629160921]
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks.
Most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations.
We have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets.
arXiv Detail & Related papers (2024-01-04T14:36:38Z) - Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for
Speech Understanding [13.527613396601268]
We propose a joint speech and language model (SLM) using a Speech2Text adapter.
SLM maps speech into text token embedding space without speech information loss.
In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance.
arXiv Detail & Related papers (2023-06-08T22:33:22Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.