DiffVoice: Text-to-Speech with Latent Diffusion
- URL: http://arxiv.org/abs/2304.11750v1
- Date: Sun, 23 Apr 2023 21:05:33 GMT
- Title: DiffVoice: Text-to-Speech with Latent Diffusion
- Authors: Zhijun Liu, Yiwei Guo, Kai Yu
- Abstract summary: We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
- Score: 18.150627638754923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we present DiffVoice, a novel text-to-speech model based on
latent diffusion. We propose to first encode speech signals into a phoneme-rate
latent representation with a variational autoencoder enhanced by adversarial
training, and then jointly model the duration and the latent representation
with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS
datasets demonstrate that our method beats the best publicly available systems
in naturalness. By adopting recent generative inverse problem solving
algorithms for diffusion models, DiffVoice achieves the state-of-the-art
performance in text-based speech editing, and zero-shot adaptation.
Related papers
- Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audiovisual speech enhancement (AVSE) approach.
It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.
Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervisedgenerative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - Sample-Efficient Diffusion for Text-To-Speech Synthesis [31.372486998377966]
It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT)
SESD achieves impressive results despite training on less than 1k hours of speech.
It synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
arXiv Detail & Related papers (2024-09-01T20:34:36Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - UnDiff: Unsupervised Voice Restoration with Unconditional Diffusion
Model [1.0874597293913013]
UnDiff is a diffusion probabilistic model capable of solving various speech inverse tasks.
It can be adapted to different tasks including inversion degradation, neural vocoding, and source separation.
arXiv Detail & Related papers (2023-06-01T14:22:55Z) - A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction.
With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement.
This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - TransFusion: Transcribing Speech with Multinomial Diffusion [20.165433724198937]
We propose a new way to perform speech recognition using a diffusion model conditioned on pretrained speech features.
We demonstrate comparable performance to existing high-performing contrastive models on the LibriSpeech speech recognition benchmark.
We also propose new techniques for effectively sampling and decoding multinomial diffusion models.
arXiv Detail & Related papers (2022-10-14T10:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.