LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems
- URL: http://arxiv.org/abs/2502.14145v2
- Date: Mon, 24 Feb 2025 19:08:11 GMT
- Title: LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems
- Authors: Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu,
- Abstract summary: Speech activity detection (VAD) module efficiently manage dialogue manager (DM) turn-taking in full SDS.<n>By processing speech in short intervals, the VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation.
- Score: 39.144526590642265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.
Related papers
- Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems [31.911085541071028]
We propose a low-latency architecture that enables listen-while-thinking and speak-while-thinking.<n>Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51%.
arXiv Detail & Related papers (2026-02-26T17:39:56Z) - Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z) - Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems [82.70507055599093]
We propose a Streaming Chain-of-Thought (CoT) framework for Duplex SDS.<n>We create intermediate targets-aligned user transcripts and system responses for each block.<n>Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods.
arXiv Detail & Related papers (2025-10-02T14:33:05Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key conversational behaviors.
We aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems [8.43958948419218]
We develop a flexible full-play control module that decouples control from spoken dialogue systems.<n>Inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state.<n>It reduces false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-play dialogue system baselines.
arXiv Detail & Related papers (2025-02-19T06:51:34Z) - SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions [48.02083833667388]
We present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions.<n>We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the Large Language Model.
arXiv Detail & Related papers (2025-01-31T18:30:36Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Conversational Rubert for Detecting Competitive Interruptions in ASR-Transcribed Dialogues [0.6138671548064356]
A system that automatically classifies interruptions can be used in call centers, specifically in the tasks of customer satisfaction monitoring and agent monitoring.
We developed a text-based interruption classification model by preparing an in-house dataset consisting of ASR-transcribed customer support telephone dialogues in Russian.
arXiv Detail & Related papers (2024-07-20T17:25:53Z) - Hello Again! LLM-powered Personalized Agent for Long-term Dialogue [63.65128176360345]
We introduce a model-agnostic framework, the Long-term Dialogue Agent (LD-Agent)<n>It incorporates three independently tunable modules dedicated to event perception, persona extraction, and response generation.<n>The effectiveness, generality, and cross-domain capabilities of LD-Agent are empirically demonstrated.
arXiv Detail & Related papers (2024-06-09T21:58:32Z) - A Full-duplex Speech Dialogue Scheme Based On Large Language Models [23.994130020644842]
generative generative dialogue system capable in a full manner allowing seamless interaction.
System generates tokens for inquiry responses and makes autonomous decisions to respond to wait for, or operate the user.
arXiv Detail & Related papers (2024-05-29T20:05:46Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - DiactTOD: Learning Generalizable Latent Dialogue Acts for Controllable
Task-Oriented Dialogue Systems [15.087619144902776]
We present a novel end-to-end latent dialogue act model (DiactTOD) that represents dialogue acts in a latent space.
When pre-trained on a large corpus, DiactTOD is able to predict and control dialogue acts to generate controllable responses.
arXiv Detail & Related papers (2023-08-01T23:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.