DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation
- URL: http://arxiv.org/abs/2310.01381v3
- Date: Sun, 10 Mar 2024 22:31:45 GMT
- Title: DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation
- Authors: Roi Benita, Michael Elad, Joseph Keshet
- Abstract summary: This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
- Score: 25.968115316199246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have recently been shown to be relevant for high-quality
speech generation. Most work has been focused on generating spectrograms, and
as such, they further require a subsequent model to convert the spectrogram to
a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic
end-to-end model for generating a raw speech waveform. The proposed model is
autoregressive, generating overlapping frames sequentially, where each frame is
conditioned on a portion of the previously generated one. Hence, our model can
effectively synthesize an unlimited speech duration while preserving
high-fidelity synthesis and temporal coherence. We implemented the proposed
model for unconditional and conditional speech generation, where the latter can
be driven by an input sequence of phonemes, amplitudes, and pitch values.
Working on the waveform directly has some empirical advantages. Specifically,
it allows the creation of local acoustic behaviors, like vocal fry, which makes
the overall waveform sounds more natural. Furthermore, the proposed diffusion
model is stochastic and not deterministic; therefore, each inference generates
a slightly different waveform variation, enabling abundance of valid
realizations. Experiments show that the proposed model generates speech with
superior quality compared with other state-of-the-art neural speech generation
systems.
Related papers
- PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation [37.35829410807451]
We propose PeriodWave, a novel universal waveform generation model.
We introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal.
We also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference.
arXiv Detail & Related papers (2024-08-14T13:36:17Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers [50.90457644954857]
In this work, we apply diffusion models to approach sequence-to-sequence text generation.
We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation.
Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
arXiv Detail & Related papers (2022-12-20T15:16:24Z) - Period VITS: Variational Inference with Explicit Pitch Modeling for
End-to-end Emotional Speech Synthesis [19.422230767803246]
We propose Period VITS, a novel end-to-end text-to-speech model that incorporates an explicit periodicity generator.
In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text.
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
arXiv Detail & Related papers (2022-10-28T07:52:30Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - PeriodNet: A non-autoregressive waveform generation model with a
structure separating periodic and aperiodic components [32.3009716052971]
We propose a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms.
The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature.
arXiv Detail & Related papers (2021-02-15T19:00:08Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation.
It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram.
It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z) - WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density.
It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram.
We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.