WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2106.09660v2
- Date: Sat, 19 Jun 2021 01:47:11 GMT
- Title: WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
- Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi,
Najim Dehak, William Chan
- Abstract summary: WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
- Score: 80.60577805727624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces WaveGrad 2, a non-autoregressive generative model for
text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the
log conditional density of the waveform given a phoneme sequence. The model
takes an input phoneme sequence, and through an iterative refinement process,
generates an audio waveform. This contrasts to the original WaveGrad vocoder
which conditions on mel-spectrogram features, generated by a separate model.
The iterative refinement process starts from Gaussian noise, and through a
series of refinement steps (e.g., 50 steps), progressively recovers the audio
sequence. WaveGrad 2 offers a natural way to trade-off between inference speed
and sample quality, through adjusting the number of refinement steps.
Experiments show that the model can generate high fidelity audio, approaching
the performance of a state-of-the-art neural TTS system. We also report various
ablation studies over different model configurations. Audio samples are
available at https://wavegrad.github.io/v2.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - A Generative Model for Raw Audio Using Transformer Architectures [4.594159253008448]
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures.
We propose a deep neural network for generating waveforms, similar to wavenet citeoord2016wavenet.
Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step.
arXiv Detail & Related papers (2021-06-30T13:05:31Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.