Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems
- URL: http://arxiv.org/abs/2601.20230v2
- Date: Thu, 29 Jan 2026 01:37:26 GMT
- Title: Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems
- Authors: Haoyuan Yu, Yuxuan Chen, Minjie Cai,
- Abstract summary: Full voice interaction is process of natural human computer interaction.<n>This framework framework synthesises complex dialogue into minimal conversational units.<n>System operates in a train-free, plug-play manner.
- Score: 17.54500572999039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - F-Actor: Controllable Conversational Behaviour in Full-Duplex Models [70.48189107402145]
We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
arXiv Detail & Related papers (2026-01-16T14:25:57Z) - SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation [19.007557608856565]
SDialog is an MIT-licensed open-source Python toolkit for building and analyzing conversational agents.<n>It unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework.<n>By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
arXiv Detail & Related papers (2025-12-09T21:42:41Z) - Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z) - OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios [45.78414948567598]
We propose leveraging synthetic data to enhance the dialogue models across diverse scenarios.<n>We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios.<n>We also explore critical aspects of training dialogue systems using synthetic data.
arXiv Detail & Related papers (2025-01-02T17:58:23Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented
Instruction Tuning for Digital Human [76.62897301298699]
ChatPLUG is a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format.
We show that modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation.
We deploy modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference.
arXiv Detail & Related papers (2023-04-16T18:16:35Z) - Fusing task-oriented and open-domain dialogues in conversational agents [12.338220374261343]
Two dialogue modes can potentially be intertwined together seamlessly in the same conversation, as easily done by a friendly human assistant.
Our paper addresses this problem of fusing TODs and ODDs in multi-turn dialogues.
It features inter-mode contextual dependency, i.e., the dialogue turns from the two modes depend on each other.
arXiv Detail & Related papers (2021-09-09T09:48:26Z) - Transferable Dialogue Systems and User Simulators [17.106518400787156]
One of the difficulties in training dialogue systems is the lack of training data.
We explore the possibility of creating dialogue data through the interaction between a dialogue system and a user simulator.
We develop a modelling framework that can incorporate new dialogue scenarios through self-play between the two agents.
arXiv Detail & Related papers (2021-07-25T22:59:09Z) - Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes [49.901984490961624]
We propose the first unsupervised abstractive dialogue summarization model for tete-a-tetes (SuTaT)
SuTaT consists of a conditional generative module and two unsupervised summarization modules.
Experimental results show that SuTaT is superior on unsupervised dialogue summarization for both automatic and human evaluations.
arXiv Detail & Related papers (2020-09-15T03:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.