ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control
- URL: http://arxiv.org/abs/2406.01205v3
- Date: Wed, 04 Jun 2025 06:07:17 GMT
- Title: ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control
- Authors: Shengpeng Ji, Qian Chen, Wen Wang, Jialong Zuo, Minghui Fang, Ziyue Jiang, Hai Huang, Zehan Wang, Xize Cheng, Siqi Zheng, Zhou Zhao,
- Abstract summary: ControlSpeech is a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style.<n>We show that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability.
- Score: 50.27383290553548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker's voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging task: a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture codec representations corresponding to timbre, content, and style in a discrete decoupling codec space. Moreover, we analyze the many-to-many issue in textual style control and propose the Style Mixture Semantic Density (SMSD) module, which is based on Gaussian mixture density networks, to resolve this problem. To facilitate empirical validations, we make available a new style controllable dataset called VccmDataset. Our experimental results demonstrate that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .
Related papers
- Voice Impression Control in Zero-Shot TTS [15.46515385197271]
We develop a voice impression control method in zero-shot text-to-speech.<n>We use a low-dimensional vector to represent the intensities of various voice impression pairs.<n>The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control.
arXiv Detail & Related papers (2025-06-06T02:34:29Z) - Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens [31.575335190916995]
We introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech that decomposes speech into two complementary token types.
To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations.
arXiv Detail & Related papers (2025-03-03T16:23:10Z) - IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System [6.686126079510178]
We introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model.
Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin.
Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning.
arXiv Detail & Related papers (2025-02-08T10:23:20Z) - Description-based Controllable Text-to-Speech with Cross-Lingual Voice
Control [14.145510487599932]
We propose a novel controllable text-to-speech (TTS) method with cross-lingual control capability.
We combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model.
Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages.
arXiv Detail & Related papers (2024-09-26T01:08:09Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Controllable Speaking Styles Using a Large Language Model [13.642358232817342]
Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text.
Currently, controlling these models during inference typically requires finding an appropriate reference utterance.
Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context.
arXiv Detail & Related papers (2023-05-17T16:01:50Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Towards Natural and Controllable Cross-Lingual Voice Conversion Based on
Neural TTS Model and Phonetic Posteriorgram [21.652906261475533]
Cross-lingual voice conversion is a challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages.
We build upon the neural text-to-speech (TTS) model to design a new cross-lingual VC framework named FastSpeech-VC.
arXiv Detail & Related papers (2021-02-03T10:28:07Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.