Chronological Thinking in Full-Duplex Spoken Dialogue Language Models
- URL: http://arxiv.org/abs/2510.05150v2
- Date: Wed, 08 Oct 2025 21:35:40 GMT
- Title: Chronological Thinking in Full-Duplex Spoken Dialogue Language Models
- Authors: Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, Eng Siong Chng,
- Abstract summary: Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
- Score: 66.84843878538207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
Related papers
- Conversational Behavior Modeling Foundation Model With Multi-Level Perception [13.659870465634228]
We introduce framework models that predict intent and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT)<n>GoT structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act generate concise justifications for its decisions.<n>Experiments show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
arXiv Detail & Related papers (2026-02-11T17:32:52Z) - F-Actor: Controllable Conversational Behaviour in Full-Duplex Models [70.48189107402145]
We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
arXiv Detail & Related papers (2026-01-16T14:25:57Z) - The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era [95.35748535806744]
We launch the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026.<n>This paper summarizes the dataset, track configurations, and the final results.
arXiv Detail & Related papers (2026-01-09T06:32:30Z) - Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech [15.41279444168073]
We introduce framework enabling reasoning over conversational behaviors by modeling this process as causal inference within a Graph ofThoughts (GoT)<n>We develop a hybrid corpus that pairs controllable, event-rich simulations with humanannotated rationales and real conversational speech.<n>GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act.
arXiv Detail & Related papers (2025-12-25T15:00:50Z) - FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z) - Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [24.71649541757314]
Short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue.<n>This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection model.
arXiv Detail & Related papers (2024-10-21T11:57:56Z) - DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking [12.71072798544731]
DUMA embodies a dual-mind mechanism through the utilization of two generative Large Language Models (LLMs) dedicated to fast and slow thinking respectively.
We have constructed a conversational agent to handle online inquiries in the real estate industry.
arXiv Detail & Related papers (2023-10-27T11:43:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.