Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
- URL: http://arxiv.org/abs/2510.02066v1
- Date: Thu, 02 Oct 2025 14:33:05 GMT
- Title: Chain-of-Thought Reasoning in Streaming Full-Duplex End-to-End Spoken Dialogue Systems
- Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe,
- Abstract summary: We propose a Streaming Chain-of-Thought (CoT) framework for Duplex SDS.<n>We create intermediate targets-aligned user transcripts and system responses for each block.<n>Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods.
- Score: 82.70507055599093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most end-to-end (E2E) spoken dialogue systems (SDS) rely on voice activity detection (VAD) for turn-taking, but VAD fails to distinguish between pauses and turn completions. Duplex SDS models address this by predicting output continuously, including silence tokens, thus removing the need for explicit VAD. However, they often have complex dual-channel architecture and lag behind cascaded models in semantic reasoning. To overcome these challenges, we propose SCoT: a Streaming Chain-of-Thought (CoT) framework for Duplex SDS, alternating between processing fixed-duration user input and generating responses in a blockwise manner. Using frame-level alignments, we create intermediate targets-aligned user transcripts and system responses for each block. Experiments show that our approach produces more coherent and interpretable responses than existing duplex methods while supporting lower-latency and overlapping interactions compared to turn-by-turn systems.
Related papers
- Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback [82.70507055599093]
We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models.<n>Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness.
arXiv Detail & Related papers (2026-01-27T00:55:14Z) - FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems [25.6510200528785]
Existing benchmarks for FD scenes, e.g., evaluating model performance lack metrics for FD scenes.<n>This paper assesses FDSDS's ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with novel metrics.<n>We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours generated speech, with 1,200 conversations simulated interruptions.
arXiv Detail & Related papers (2025-07-25T07:51:22Z) - Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios [33.271537268488316]
Serialized Output Training (SOT) addresses practical needs of both streaming and offline automatic speech recognition (ASR) applications.<n>Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements.
arXiv Detail & Related papers (2025-06-17T05:46:38Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems [39.144526590642265]
Speech activity detection (VAD) module efficiently manage dialogue manager (DM) turn-taking in full SDS.<n>By processing speech in short intervals, the VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation.
arXiv Detail & Related papers (2025-02-19T23:15:13Z) - FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems [8.43958948419218]
We develop a flexible full-play control module that decouples control from spoken dialogue systems.<n>Inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state.<n>It reduces false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-play dialogue system baselines.
arXiv Detail & Related papers (2025-02-19T06:51:34Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - A Sequence-to-Sequence Approach to Dialogue State Tracking [17.81139775400199]
Seq2Seq-DU formalizes dialogue state tracking as a sequence-to-sequence problem.
It can jointly model intents, slots, and slot values.
It can effectively deal with categorical and non-categorical slots, and unseen schemas.
arXiv Detail & Related papers (2020-11-18T21:42:44Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.