Continuous Wavelet Vocoder-based Decomposition of Parametric Speech
Waveform Synthesis
- URL: http://arxiv.org/abs/2106.06863v1
- Date: Sat, 12 Jun 2021 20:55:44 GMT
- Title: Continuous Wavelet Vocoder-based Decomposition of Parametric Speech
Waveform Synthesis
- Authors: Mohammed Salah Al-Radhi, Tam\'as G\'abor Csap\'o, Csaba Zaink\'o,
G\'eza N\'emeth
- Abstract summary: Speech technology systems have adopted the vocoder approach to synthesizing speech waveform.
WaveNet is one of the best models that nearly resembles the human voice.
- Score: 2.6572330982240935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To date, various speech technology systems have adopted the vocoder approach,
a method for synthesizing speech waveform that shows a major role in the
performance of statistical parametric speech synthesis. WaveNet one of the best
models that nearly resembles the human voice, has to generate a waveform in a
time consuming sequential manner with an extremely complex structure of its
neural networks.
Related papers
- DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Differentiable Wavetable Synthesis [7.585969077788285]
Differentiable Wavetable Synthesis (DWTS) is a technique for neural audio synthesis which learns a dictionary of one-period waveforms.
We achieve high-fidelity audio synthesis with as little as 10 to 20 wavetables.
We show audio manipulations, such as high quality pitch-shifting, using only a few seconds of input audio.
arXiv Detail & Related papers (2021-11-19T01:42:42Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Pretraining Strategies, Waveform Model Choice, and Acoustic
Configurations for Multi-Speaker End-to-End Speech Synthesis [47.30453049606897]
We find that fine-tuning a multi-speaker model from found audiobook data can improve naturalness and similarity to unseen target speakers of synthetic speech.
We also find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet.
arXiv Detail & Related papers (2020-11-10T00:19:04Z) - Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis [25.234945748885348]
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2020-11-06T19:30:07Z) - DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation.
It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram.
It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.