Make-A-Voice: Unified Voice Synthesis With Discrete Representation
- URL: http://arxiv.org/abs/2305.19269v1
- Date: Tue, 30 May 2023 17:59:26 GMT
- Title: Make-A-Voice: Unified Voice Synthesis With Discrete Representation
- Authors: Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu,
Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu
- Abstract summary: Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
- Score: 77.3998611565557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various applications of voice synthesis have been developed independently
despite the fact that they generate "voice" as output in common. In addition,
the majority of voice synthesis models currently rely on annotated audio data,
but it is crucial to scale them to self-supervised datasets in order to
effectively capture the wide range of acoustic variations present in human
voice, including speaker identity, emotion, and prosody. In this work, we
propose Make-A-Voice, a unified framework for synthesizing and manipulating
voice signals from discrete representations. Make-A-Voice leverages a
"coarse-to-fine" approach to model the human voice, which involves three
stages: 1) semantic stage: model high-level transformation between linguistic
content and self-supervised semantic tokens, 2) acoustic stage: introduce
varying control signals as acoustic conditions for semantic-to-acoustic
modeling, and 3) generation stage: synthesize high-fidelity waveforms from
acoustic tokens. Make-A-Voice offers notable benefits as a unified voice
synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic
and generation stage) does not require any annotations, and thus the training
data could be scaled up. 2) Controllability and conditioning flexibility: we
investigate different conditioning mechanisms and effectively handle three
voice synthesis applications, including text-to-speech (TTS), voice conversion
(VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice
representations with prompt guidance. Experimental results demonstrate that
Make-A-Voice exhibits superior audio quality and style similarity compared with
competitive baseline models. Audio samples are available at
https://Make-A-Voice.github.io
Related papers
- Articulatory Phonetics Informed Controllable Expressive Speech Synthesis [14.157690391680745]
We explore expressive speech synthesis through the lens of articulatory phonetics.
We record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor.
We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability on two fine-tuned expressive TTS models.
arXiv Detail & Related papers (2024-06-15T05:37:04Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.
StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.
Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Enhancing audio quality for expressive Neural Text-to-Speech [8.199224915764672]
We present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data.
We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.
arXiv Detail & Related papers (2021-08-13T14:32:39Z) - Audiovisual Speech Synthesis using Tacotron2 [14.206988023567828]
We propose and compare two audiovisual speech synthesis systems for 3D face models.
AVTacotron2 is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture.
The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2.
arXiv Detail & Related papers (2020-08-03T02:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.