Related papers: Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech

Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech

URL: http://arxiv.org/abs/2512.21706v1
Date: Thu, 25 Dec 2025 15:00:50 GMT
Title: Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech
Authors: Shuchang Pan, Siddharth Banerjee, Dhruv Hebbar, Siddhant Patel, Akshaj Gupta, Kan Jen Cheng, Hanjo Kim, Zeyi Austin Li, Martin Q. Ma, Tingle Li, Gopala Anumanchipalli, Jiachen Lian,
Abstract summary: We introduce framework enabling reasoning over conversational behaviors by modeling this process as causal inference within a Graph ofThoughts (GoT)<n>We develop a hybrid corpus that pairs controllable, event-rich simulations with humanannotated rationales and real conversational speech.<n>GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act.
Score: 15.41279444168073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

Related papers

Conversational Behavior Modeling Foundation Model With Multi-Level Perception [13.659870465634228]
We introduce framework models that predict intent and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT)<n>GoT structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act generate concise justifications for its decisions.<n>Experiments show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
arXiv Detail & Related papers (2026-02-11T17:32:52Z)
Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models [70.48189107402145]
We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
arXiv Detail & Related papers (2026-01-16T14:25:57Z)
Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
Unsupervised Mutual Learning of Discourse Parsing and Topic Segmentation in Dialogue [37.618612723025784]
In dialogue systems, discourse plays a crucial role in managing conversational focus and coordinating interactions.<n>It consists of two key structures: rhetorical structure and topic structure.<n>We introduce a unified representation that integrates rhetorical and topic structures, ensuring semantic consistency between them.<n>We propose an unsupervised mutual learning framework (UMLF) that jointly models rhetorical and topic structures, allowing them to mutually reinforce each other without requiring additional annotations.
arXiv Detail & Related papers (2024-05-30T08:10:50Z)
Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model [8.180382743037082]
This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously.
arXiv Detail & Related papers (2023-09-20T01:48:27Z)
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation [62.44907105496227]
MindDial is a novel conversational framework that can generate situated free-form responses with theory-of-mind modeling. We introduce an explicit mind module that can track the speaker's belief and the speaker's prediction of the listener's belief. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation.
arXiv Detail & Related papers (2023-06-27T07:24:32Z)
Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics. We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context. Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.