Retrieval Augmented End-to-End Spoken Dialog Models
- URL: http://arxiv.org/abs/2402.01828v1
- Date: Fri, 2 Feb 2024 18:23:09 GMT
- Title: Retrieval Augmented End-to-End Spoken Dialog Models
- Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu,
Laurent El Shafey
- Abstract summary: We apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal.
Inspired by the RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented SLM (ReSLM) that overcomes this weakness.
We evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that this retrieval augmentation boosts model performance.
- Score: 20.896330994089283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We recently developed SLM, a joint speech and language model, which fuses a
pretrained foundational speech model and a large language model (LLM), while
preserving the in-context learning capability intrinsic to the pretrained LLM.
In this paper, we apply SLM to speech dialog applications where the dialog
states are inferred directly from the audio signal.
Task-oriented dialogs often contain domain-specific entities, i.e.,
restaurants, hotels, train stations, and city names, which are difficult to
recognize, however, critical for the downstream applications. Inspired by the
RAG (retrieval-augmented generation) paradigm, we propose a retrieval augmented
SLM (ReSLM) that overcomes this weakness. We first train a speech retriever to
retrieve text entities mentioned in the audio. The retrieved entities are then
added as text inputs to the underlying SLM to bias model predictions. We
evaluated ReSLM on speech MultiWoz task (DSTC-11 challenge), and found that
this retrieval augmentation boosts model performance, achieving joint goal
accuracy (38.6% vs 32.7%), slot error rate (20.6% vs 24.8%) and ASR word error
rate (5.5% vs 6.7%). While demonstrated on dialog state tracking, our approach
is broadly applicable to other speech tasks requiring contextual information or
domain-specific entities, such as contextual ASR with biasing capability.
Related papers
- Unified Speech-Text Pretraining for Spoken Dialog Modeling [42.59768604228263]
This work proposes an extensive speech-text LLM framework to generate coherent spoken responses with organic prosodic features relevant to the given input speech.
Our approach employs a multi-step speech-text inference scheme that leverages chain-of-reasoning capabilities exhibited by the underlying LLM.
We show that the proposed approach is effective in generating natural-sounding spoken responses, outperforming both prior and cascaded baselines.
arXiv Detail & Related papers (2024-02-08T14:35:09Z) - Are LLMs Robust for Spoken Dialogues? [10.855403629160921]
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks.
Most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations.
We have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets.
arXiv Detail & Related papers (2024-01-04T14:36:38Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for
Speech Understanding [13.527613396601268]
We propose a joint speech and language model (SLM) using a Speech2Text adapter.
SLM maps speech into text token embedding space without speech information loss.
In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance.
arXiv Detail & Related papers (2023-06-08T22:33:22Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - Joint Modelling of Spoken Language Understanding Tasks with Integrated
Dialog History [30.20353302347147]
We propose a novel model architecture that learns dialog context to jointly predict the intent, dialog act, speaker role, and emotion for the spoken utterance.
Our experiments show that our joint model achieves similar results to task-specific classifiers.
arXiv Detail & Related papers (2023-05-01T16:26:18Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.