FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
- URL: http://arxiv.org/abs/2502.13472v2
- Date: Thu, 29 May 2025 03:32:21 GMT
- Title: FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
- Authors: Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, Di Zhang,
- Abstract summary: We develop a flexible full-play control module that decouples control from spoken dialogue systems.<n>Inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state.<n>It reduces false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-play dialogue system baselines.
- Score: 8.43958948419218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems [17.54500572999039]
Full voice interaction is process of natural human computer interaction.<n>This framework framework synthesises complex dialogue into minimal conversational units.<n>System operates in a train-free, plug-play manner.
arXiv Detail & Related papers (2026-01-28T04:00:37Z) - Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems [82.70507055599093]
We propose a Streaming Chain-of-Thought (CoT) framework for Duplex SDS.<n>We create intermediate targets-aligned user transcripts and system responses for each block.<n>Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods.
arXiv Detail & Related papers (2025-10-02T14:33:05Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching [79.0241611035794]
CoVoMix2 is a framework for zero-shot multi-talker dialogue generation.<n>It predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model.<n>Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed.
arXiv Detail & Related papers (2025-06-01T07:51:45Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key conversational behaviors.
We aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems [39.144526590642265]
Speech activity detection (VAD) module efficiently manage dialogue manager (DM) turn-taking in full SDS.
By processing speech in short intervals, the VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation.
arXiv Detail & Related papers (2025-02-19T23:15:13Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue
System [120.70726465994781]
multimodal spoken dialogue system enables telephonebased agents to interact with customers like human.
We deploy Conversation Duplex Alibaba intelligent customer service to share lessons learned in production.
Online A/B experiments show in proposed system can significantly reduce response latency by 50%.
arXiv Detail & Related papers (2022-05-30T12:41:23Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Smoothing Dialogue States for Open Conversational Machine Reading [70.83783364292438]
We propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation.
Experiments on the OR-ShARC dataset show the effectiveness of our method, which achieves new state-of-the-art results.
arXiv Detail & Related papers (2021-08-28T08:04:28Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - Multi-Domain Dialogue Acts and Response Co-Generation [34.27525685962274]
We propose a neural co-generation model that generates dialogue acts and responses concurrently.
Our model achieves very favorable improvement over several state-of-the-art models in both automatic and human evaluations.
arXiv Detail & Related papers (2020-04-26T12:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.