High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models
- URL: http://arxiv.org/abs/2309.15512v2
- Date: Mon, 18 Dec 2023 12:52:08 GMT
- Title: High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models
- Authors: Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang,
Jianwu Dang
- Abstract summary: Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
- Score: 56.00939852727501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-speech (TTS) methods have shown promising results in voice cloning,
but they require a large number of labeled text-speech pairs.
Minimally-supervised speech synthesis decouples TTS by combining two types of
discrete speech representations(semantic \& acoustic) and using two
sequence-to-sequence tasks to enable training with minimal supervision.
However, existing methods suffer from information redundancy and dimension
explosion in semantic representation, and high-frequency waveform distortion in
discrete acoustic representation. Autoregressive frameworks exhibit typical
instability and uncontrollability issues. And non-autoregressive frameworks
suffer from prosodic averaging caused by duration prediction models. To address
these issues, we propose a minimally-supervised high-fidelity speech synthesis
method, where all modules are constructed based on the diffusion models. The
non-autoregressive framework enhances controllability, and the duration
diffusion model enables diversified prosodic expression. Contrastive
Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic
representation to solve the problems of information redundancy and dimension
explosion in existing semantic coding methods. Mel-spectrogram is used as the
acoustic representation. Both semantic and acoustic representations are
predicted by continuous variable regression tasks to solve the problem of
high-frequency fine-grained waveform distortion. Experimental results show that
our proposed method outperforms the baseline method. We provide audio samples
on our website.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.