Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2505.00028v1
- Date: Sun, 27 Apr 2025 14:35:24 GMT
- Title: Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
- Authors: Pengchao Feng, Ziyang Ma, Wenxi Chen, Yao Li, Sheng Wang, Kai Yu, Xie Chen,
- Abstract summary: We propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries.<n> Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems.<n>Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems.
- Score: 13.559210762117061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.
Related papers
- WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Topic-Aware Response Generation in Task-Oriented Dialogue with
Unstructured Knowledge Access [20.881612071473118]
We propose Topic-Aware Response Generation'' (TARG) to better integrate topical information in task-oriented dialogue.
TARG incorporates multiple topic-aware attention mechanisms to derive the importance weighting scheme over dialogue utterances and external knowledge sources.
arXiv Detail & Related papers (2022-12-10T22:32:28Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Adapting Document-Grounded Dialog Systems to Spoken Conversations using
Data Augmentation and a Noisy Channel Model [46.93744191416991]
This paper summarizes our submission to Task 2 of the 10th Dialog System Technology Challenge (DSTC10) "Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations"
Similar to the previous year's iteration, the task consists of three subtasks: detecting whether a turn is knowledge seeking, selecting the relevant knowledge document and finally generating a grounded response.
Our best system achieved the 1st rank in the automatic and the 3rd rank in the human evaluation of the challenge.
arXiv Detail & Related papers (2021-12-16T12:51:52Z) - Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language
Understanding [14.157311972146692]
We propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts.
Our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively.
arXiv Detail & Related papers (2021-12-13T15:49:36Z) - Retrieval-Free Knowledge-Grounded Dialogue Response Generation with
Adapters [52.725200145600624]
We propose KnowExpert to bypass the retrieval process by injecting prior knowledge into the pre-trained language models with lightweight adapters.
Experimental results show that KnowExpert performs comparably with the retrieval-based baselines.
arXiv Detail & Related papers (2021-05-13T12:33:23Z) - Modelling Hierarchical Structure between Dialogue Policy and Natural
Language Generator with Option Framework for Task-oriented Dialogue System [49.39150449455407]
HDNO is an option framework for designing latent dialogue acts to avoid designing specific dialogue act representations.
We test HDNO on MultiWoz 2.0 and MultiWoz 2.1, the datasets on multi-domain dialogues, in comparison with word-level E2E model trained with RL, LaRL and HDSA.
arXiv Detail & Related papers (2020-06-11T20:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.