F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
- URL: http://arxiv.org/abs/2601.11329v1
- Date: Fri, 16 Jan 2026 14:25:57 GMT
- Title: F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
- Authors: Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch, Tsz Kin Lam,
- Abstract summary: We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
- Score: 70.48189107402145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models [33.33273575953341]
We introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts.<n> PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations.<n>Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness.
arXiv Detail & Related papers (2026-01-14T07:47:46Z) - DiscussLLM: Teaching Large Language Models When to Speak [9.441455921296301]
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text.<n>We introduce $textitDiscussLLM$, a framework designed to bridge this gap by training models to proactively decide not just $textitwhat$ to say, but critically, $textitwhen$ to speak.
arXiv Detail & Related papers (2025-08-25T16:16:42Z) - Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance [47.2016265294791]
Full-Duplex Speech Language Models (FD-SLMs) capture nuanced two-speaker dialogue patterns for human-like interactions.<n>They face a critical challenge -- their conversational abilities often degrade compared to pure-text conversation.<n>We propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning.
arXiv Detail & Related papers (2025-08-10T14:49:43Z) - Chain-of-Thought Training for Open E2E Spoken Dialogue Systems [57.77235760292348]
End-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information.<n>We propose a chain-of-thought (CoT) formulation to ensure that training on conversational data remains closely aligned with the multimodal language model.<n>Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets.
arXiv Detail & Related papers (2025-05-31T21:43:37Z) - Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction [9.101978573666546]
Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation.<n>It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
arXiv Detail & Related papers (2025-02-24T15:16:34Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - The Adapter-Bot: All-In-One Controllable Conversational Model [66.48164003532484]
We propose a dialogue model that uses a fixed backbone model such as DialGPT and triggers on-demand dialogue skills via different adapters.
Depending on the skills, the model is able to process multiple knowledge types, such as text, tables, and emphatic responses.
We evaluate our model using automatic evaluation by comparing it with existing state-of-the-art conversational models.
arXiv Detail & Related papers (2020-08-28T10:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.