Related papers: From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models

From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models

URL: http://arxiv.org/abs/2509.14515v1
Date: Thu, 18 Sep 2025 01:00:58 GMT
Title: From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models
Authors: Yuxuan Chen, Haoyuan Yu,
Abstract summary: Full-Duplex voice communication enables simultaneous listening and speaking with natural turn-taking, overlapping speech, and interruptions.<n>This survey comprehensively reviews Full-Duplex Spoken Language Models (FD-SLMs)<n>We identify fundamental challenges: synchronous data scarcity, architectural divergence, and evaluation gaps.
Score: 12.741006204459637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: True Full-Duplex (TFD) voice communication--enabling simultaneous listening and speaking with natural turn-taking, overlapping speech, and interruptions--represents a critical milestone toward human-like AI interaction. This survey comprehensively reviews Full-Duplex Spoken Language Models (FD-SLMs) in the LLM era. We establish a taxonomy distinguishing Engineered Synchronization (modular architectures) from Learned Synchronization (end-to-end architectures), and unify fragmented evaluation approaches into a framework encompassing Temporal Dynamics, Behavioral Arbitration, Semantic Coherence, and Acoustic Performance. Through comparative analysis of mainstream FD-SLMs, we identify fundamental challenges: synchronous data scarcity, architectural divergence, and evaluation gaps, providing a roadmap for advancing human-AI communication.

Related papers

Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z)
Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment [18.97451964522765]
We propose a novel residual hierarchical interactive method, HIA, that enables bidirectional modeling across granularities.<n>We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies.<n>Our model is comprehensively ahead of the existing state-of-the-art methods.
arXiv Detail & Related papers (2026-01-05T02:43:04Z)
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models [48.34642579013783]
MTR-DuplexBench is a novel benchmark for evaluating FDSLMs in multi-round settings.<n>We show that MTR-DuplexBench provides comprehensive, turn-by-turn evaluation of FDSLMs across dialogue quality, conversational dynamics, following instruction, and safety.
arXiv Detail & Related papers (2025-11-13T12:50:04Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
WaveMind: Towards a Conversational EEG Foundation Model Aligned to Textual and Visual Modalities [55.00677513249723]
EEG signals simultaneously encode both cognitive processes and intrinsic neural states.<n>We map EEG signals and their corresponding modalities into a unified semantic space to achieve generalized interpretation.<n>The resulting model demonstrates robust classification accuracy while supporting flexible, open-ended conversations.
arXiv Detail & Related papers (2025-09-26T06:21:51Z)
Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z)
DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations [48.17593420058064]
This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling.<n>Our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz.<n> Experimental results on Spoken Question Answering benchmarks demonstrate that D RVOICE establishes new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-06-11T02:57:22Z)
A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents [0.0]
Multiturn dialogue models aim to generate human-like responses by leveraging conversational context.<n>Existing methods often neglect the interactions between these utterances or treat all of them as equally significant.<n>This paper introduces a discourse-aware framework for response selection in retrieval-based dialogue systems.
arXiv Detail & Related papers (2025-04-12T04:22:18Z)
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.