Related papers: MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

URL: http://arxiv.org/abs/2511.12074v2
Date: Wed, 19 Nov 2025 14:50:05 GMT
Title: MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement
Authors: Xinyue Yu, Youqing Fang, Pingyu Wu, Guoyang Ye, Wenbo Zhou, Weiming Zhang, Song Xiao,
Abstract summary: We propose a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator.<n>MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure representations of content, timbre, and emotion.<n>MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization.
Score: 31.756885606945847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation [60.02195766025208]
We present AR- Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders.<n>AR- Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder.<n>We address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism.
arXiv Detail & Related papers (2026-01-25T09:17:36Z)
VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency [28.98083807303608]
Speech-LLMs show strong performance in many applications, but their robustness is critically under-tested, especially to speech disfluency.<n>This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments.
arXiv Detail & Related papers (2025-10-17T08:01:41Z)
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control [20.873353104077857]
We introduce an approach centered on prompt-based emotion control.<n>The proposed architecture incorporates emotion and intensity control across multi-speakers.<n>We leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content.
arXiv Detail & Related papers (2025-01-10T12:10:30Z)
DDTSE: Discriminative Diffusion Model for Target Speech Extraction [62.422291953387955]
We introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE) We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. We devise a two-stage training strategy to emulate the inference process during model training.
arXiv Detail & Related papers (2023-09-25T04:58:38Z)
Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z)
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos [34.306490673301184]
The goal of this work is to reconstruct high quality speech from lip motions alone. A key challenge of lip-to-speech systems is the one-to-many mapping. We propose a novel lip-to-speech system that significantly improves the generation quality.
arXiv Detail & Related papers (2023-08-29T12:30:53Z)
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Adversarially learning disentangled speech representations for robust multi-factor voice conversion [39.91395314356084]
We propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled. Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
arXiv Detail & Related papers (2021-01-30T08:29:55Z)
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.