Related papers: Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

URL: http://arxiv.org/abs/2601.20230v2
Date: Thu, 29 Jan 2026 01:37:26 GMT
Title: Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems
Authors: Haoyuan Yu, Yuxuan Chen, Minjie Cai,
Abstract summary: Full voice interaction is process of natural human computer interaction.<n>This framework framework synthesises complex dialogue into minimal conversational units.<n>System operates in a train-free, plug-play manner.
Score: 17.54500572999039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.

Related papers

Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models [70.48189107402145]
We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
arXiv Detail & Related papers (2026-01-16T14:25:57Z)
SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation [19.007557608856565]
SDialog is an MIT-licensed open-source Python toolkit for building and analyzing conversational agents.<n>It unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework.<n>By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.
arXiv Detail & Related papers (2025-12-09T21:42:41Z)
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z)
OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios [45.78414948567598]
We propose leveraging synthetic data to enhance the dialogue models across diverse scenarios.<n>We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios.<n>We also explore critical aspects of training dialogue systems using synthetic data.
arXiv Detail & Related papers (2025-01-02T17:58:23Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human [76.62897301298699]
ChatPLUG is a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format. We show that modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation. We deploy modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference.
arXiv Detail & Related papers (2023-04-16T18:16:35Z)
Fusing task-oriented and open-domain dialogues in conversational agents [12.338220374261343]
Two dialogue modes can potentially be intertwined together seamlessly in the same conversation, as easily done by a friendly human assistant. Our paper addresses this problem of fusing TODs and ODDs in multi-turn dialogues. It features inter-mode contextual dependency, i.e., the dialogue turns from the two modes depend on each other.
arXiv Detail & Related papers (2021-09-09T09:48:26Z)
Transferable Dialogue Systems and User Simulators [17.106518400787156]
One of the difficulties in training dialogue systems is the lack of training data. We explore the possibility of creating dialogue data through the interaction between a dialogue system and a user simulator. We develop a modelling framework that can incorporate new dialogue scenarios through self-play between the two agents.
arXiv Detail & Related papers (2021-07-25T22:59:09Z)
Unsupervised Abstractive Dialogue Summarization for Tete-a-Tetes [49.901984490961624]
We propose the first unsupervised abstractive dialogue summarization model for tete-a-tetes (SuTaT) SuTaT consists of a conditional generative module and two unsupervised summarization modules. Experimental results show that SuTaT is superior on unsupervised dialogue summarization for both automatic and human evaluations.
arXiv Detail & Related papers (2020-09-15T03:27:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.