iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating
Inverse Short-Time Fourier Transform
- URL: http://arxiv.org/abs/2203.02395v1
- Date: Fri, 4 Mar 2022 16:05:48 GMT
- Title: iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating
Inverse Short-Time Fourier Transform
- Authors: Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki
- Abstract summary: A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion.
A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network.
We propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform.
- Score: 38.271530231451834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent text-to-speech synthesis and voice conversion systems, a
mel-spectrogram is commonly applied as an intermediate representation, and the
necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram
vocoder must solve three inverse problems: recovery of the original-scale
magnitude spectrogram, phase reconstruction, and frequency-to-time conversion.
A typical convolutional mel-spectrogram vocoder solves these problems jointly
and implicitly using a convolutional neural network, including temporal
upsampling layers, when directly calculating a raw waveform. Such an approach
allows skipping redundant processes during waveform synthesis (e.g., the direct
reconstruction of high-dimensional original-scale spectrograms). By contrast,
the approach solves all problems in a black box and cannot effectively employ
the time-frequency structures existing in a mel-spectrogram. We thus propose
iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder
with the inverse short-time Fourier transform (iSTFT) after sufficiently
reducing the frequency dimension using upsampling layers, reducing the
computational cost from black-box modeling and avoiding redundant estimations
of high-dimensional spectrograms. During our experiments, we applied our ideas
to three HiFi-GAN variants and made the models faster and more lightweight with
a reasonable speech quality. Audio samples are available at
https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/.
Related papers
- Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [1.4277428617774877]
We present Vocos, a new model that directly generates Fourier spectral coefficients.
It substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.
arXiv Detail & Related papers (2023-06-01T15:40:32Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Defects of Convolutional Decoder Networks in Frequency Representation [34.70224140460288]
We prove the representation defects of a cascaded convolutional decoder network.
We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network.
arXiv Detail & Related papers (2022-10-17T12:42:29Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Learning Wave Propagation with Attention-Based Convolutional Recurrent
Autoencoder Net [0.0]
We present an end-to-end attention-based convolutional recurrent autoencoder (AB-CRAN) network for data-driven modeling of wave propagation phenomena.
We employ a denoising-based convolutional autoencoder from the full-order snapshots given by time-dependent hyperbolic partial differential equations for wave propagation.
The attention-based sequence-to-sequence network increases the time-horizon of prediction by five times compared to the plain RNN-LSTM.
arXiv Detail & Related papers (2022-01-17T20:51:59Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - Unsupervised Cross-Domain Speech-to-Speech Conversion with
Time-Frequency Consistency [14.062850439230111]
We propose a condition encouraging spectrogram consistency during the adversarial training procedure.
Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
arXiv Detail & Related papers (2020-05-15T22:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.