MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning
- URL: http://arxiv.org/abs/2601.01568v2
- Date: Thu, 08 Jan 2026 08:43:41 GMT
- Title: MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning
- Authors: Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo,
- Abstract summary: MM-Sonate is a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities.<n>To enable zero-shot voice cloning, we introduce a classifier injection mechanism that effectively decouples speaker identity from linguistic content.<n> Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks.
- Score: 18.636738208526676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.
Related papers
- DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech [8.537791317883576]
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample.<n>Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts.<n>We introduce DiFlow-TTS, which is the first model to explore purely Discrete Flow Matching for speech synthesis.
arXiv Detail & Related papers (2025-09-11T17:16:52Z) - Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion [16.19865417052239]
Discl-VC is a novel zero-shot voice conversion framework.<n>It disentangles content and prosody information from self-supervised speech representations.<n>It synthesizes the target speaker's voice through in-context learning.
arXiv Detail & Related papers (2025-05-30T07:04:23Z) - Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation [66.49076386263509]
This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation.<n>We propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space.<n>UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models.
arXiv Detail & Related papers (2025-01-11T00:47:29Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention [0.0]
We propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech.
Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention.
We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances.
arXiv Detail & Related papers (2022-01-25T15:06:07Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Advances in Speech Vocoding for Text-to-Speech with Continuous
Parameters [2.6572330982240935]
This paper presents new techniques in a continuous vocoder, that is all features are continuous and presents a flexible speech synthesis system.
New continuous noise masking based on the phase distortion is proposed to eliminate the perceptual impact of the residual noise.
Bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) are studied and applied to model continuous parameters for more natural-sounding like a human.
arXiv Detail & Related papers (2021-06-19T12:05:01Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.