From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models
- URL: http://arxiv.org/abs/2509.14515v1
- Date: Thu, 18 Sep 2025 01:00:58 GMT
- Title: From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models
- Authors: Yuxuan Chen, Haoyuan Yu,
- Abstract summary: Full-Duplex voice communication enables simultaneous listening and speaking with natural turn-taking, overlapping speech, and interruptions.<n>This survey comprehensively reviews Full-Duplex Spoken Language Models (FD-SLMs)<n>We identify fundamental challenges: synchronous data scarcity, architectural divergence, and evaluation gaps.
- Score: 12.741006204459637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: True Full-Duplex (TFD) voice communication--enabling simultaneous listening and speaking with natural turn-taking, overlapping speech, and interruptions--represents a critical milestone toward human-like AI interaction. This survey comprehensively reviews Full-Duplex Spoken Language Models (FD-SLMs) in the LLM era. We establish a taxonomy distinguishing Engineered Synchronization (modular architectures) from Learned Synchronization (end-to-end architectures), and unify fragmented evaluation approaches into a framework encompassing Temporal Dynamics, Behavioral Arbitration, Semantic Coherence, and Acoustic Performance. Through comparative analysis of mainstream FD-SLMs, we identify fundamental challenges: synchronous data scarcity, architectural divergence, and evaluation gaps, providing a roadmap for advancing human-AI communication.
Related papers
- Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z) - Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment [18.97451964522765]
We propose a novel residual hierarchical interactive method, HIA, that enables bidirectional modeling across granularities.<n>We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies.<n>Our model is comprehensively ahead of the existing state-of-the-art methods.
arXiv Detail & Related papers (2026-01-05T02:43:04Z) - MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models [48.34642579013783]
MTR-DuplexBench is a novel benchmark for evaluating FDSLMs in multi-round settings.<n>We show that MTR-DuplexBench provides comprehensive, turn-by-turn evaluation of FDSLMs across dialogue quality, conversational dynamics, following instruction, and safety.
arXiv Detail & Related papers (2025-11-13T12:50:04Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities [55.00677513249723]
EEG signals simultaneously encode both cognitive processes and intrinsic neural states.<n>We map EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation.<n>The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations.
arXiv Detail & Related papers (2025-09-26T06:21:51Z) - Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z) - DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations [48.17593420058064]
This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling.<n>Our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz.<n> Experimental results on Spoken Question Answering benchmarks demonstrate that D RVOICE establishes new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-06-11T02:57:22Z) - A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents [0.0]
Multiturn dialogue models aim to generate human-like responses by leveraging conversational context.<n>Existing methods often neglect the interactions between these utterances or treat all of them as equally significant.<n>This paper introduces a discourse-aware framework for response selection in retrieval-based dialogue systems.
arXiv Detail & Related papers (2025-04-12T04:22:18Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.