Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band
Generation and Inverse Short-Time Fourier Transform
- URL: http://arxiv.org/abs/2210.15975v1
- Date: Fri, 28 Oct 2022 08:15:05 GMT
- Title: Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band
Generation and Inverse Short-Time Fourier Transform
- Authors: Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana
- Abstract summary: We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform.
Experimental results show that our model synthesized speech as natural as that synthesized by VITS.
A smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed.
- Score: 9.606821628015933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a lightweight end-to-end text-to-speech model using multi-band
generation and inverse short-time Fourier transform. Our model is based on
VITS, a high-quality end-to-end text-to-speech model, but adopts two changes
for more efficient inference: 1) the most computationally expensive component
is partially replaced with a simple inverse short-time Fourier transform, and
2) multi-band generation, with fixed or trainable synthesis filters, is used to
generate waveforms. Unlike conventional lightweight models, which employ
optimization or knowledge distillation separately to train two cascaded
components, our method enjoys the full benefits of end-to-end optimization.
Experimental results show that our model synthesized speech as natural as that
synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel
Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the
model significantly outperformed a lightweight baseline model with respect to
both naturalness and inference speed. Code and audio samples are available from
https://github.com/MasayaKawamura/MB-iSTFT-VITS.
Related papers
- Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to
Speech [37.29193613404699]
DDPMs are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples.
Previous works have explored speeding up inference speed by minimizing the number of inference steps but at the cost of sample quality.
We propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model.
arXiv Detail & Related papers (2022-12-30T02:31:35Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - Fast-Slow Transformer for Visually Grounding Speech [15.68151998164009]
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.
FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images.
arXiv Detail & Related papers (2021-09-16T18:45:45Z) - Neural Waveshaping Synthesis [0.0]
We present a novel, lightweight, fully causal approach to neural audio synthesis.
The Neural Waveshaping Unit (NEWT) operates directly in the waveform domain.
It produces complex timbral evolutions by simple affine transformations of its input and output signals.
arXiv Detail & Related papers (2021-07-11T13:50:59Z) - Synthesizer: Rethinking Self-Attention in Transformer Models [93.08171885200922]
dot product self-attention is central and indispensable to state-of-the-art Transformer models.
This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models.
arXiv Detail & Related papers (2020-05-02T08:16:19Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z) - Efficient Trainable Front-Ends for Neural Speech Enhancement [22.313111311130665]
We present an efficient, trainable front-end based on the butterfly mechanism to compute the Fast Fourier Transform.
We show its accuracy and efficiency benefits for low-compute neural speech enhancement models.
arXiv Detail & Related papers (2020-02-20T01:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.