Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
- URL: http://arxiv.org/abs/2306.00814v3
- Date: Wed, 29 May 2024 14:21:47 GMT
- Title: Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
- Authors: Hubert Siuzdak,
- Abstract summary: We present Vocos, a new model that directly generates Fourier spectral coefficients.
It substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.
- Score: 1.4277428617774877
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.
Related papers
- Resonate-and-Fire Spiking Neurons for Target Detection and Hand Gesture Recognition: A Hybrid Approach [0.8802544215891168]
Hand gesture recognition using radar often relies on computationally expensive fast Fourier transforms.
This paper proposes an alternative approach that bypasses fast Fourier transforms using resonate-and-fire neurons.
The proposed approach demonstrates competitive performance with reduced complexity compared to traditional methods.
arXiv Detail & Related papers (2024-05-22T14:40:02Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - Neural Fourier Shift for Binaural Speech Rendering [16.957415282256758]
We present a neural network for rendering speech from given monaural audio, position, and orientation of the source.
We propose Neural Shift (NFS), a novel network architecture that enables speech rendering in the Fourier space.
arXiv Detail & Related papers (2022-11-02T04:55:09Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - Fourier Disentangled Space-Time Attention for Aerial Video Recognition [54.80846279175762]
We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition.
Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent from the background.
We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone.
arXiv Detail & Related papers (2022-03-21T01:24:53Z) - Functional Regularization for Reinforcement Learning via Learned Fourier
Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis.
We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z) - Learning Frequency Domain Approximation for Binary Neural Networks [68.79904499480025]
We propose to estimate the gradient of sign function in the Fourier frequency domain using the combination of sine functions for training BNNs.
The experiments on several benchmark datasets and neural architectures illustrate that the binary network learned using our method achieves the state-of-the-art accuracy.
arXiv Detail & Related papers (2021-03-01T08:25:26Z) - DeepPhaseCut: Deep Relaxation in Phase for Unsupervised Fourier Phase
Retrieval [31.380061715549584]
We propose a novel, unsupervised, feed-forward neural network for Fourier phase retrieval.
Unlike the existing deep learning approaches that use a neural network as a regularization term or an end-to-end blackbox model for supervised training, our algorithm is a feed-forward neural network implementation of PhaseCut algorithm in an unsupervised learning framework.
Our network is composed of two generators: one for the phase estimation using PhaseCut loss, followed by another generator for image reconstruction, all of which are trained simultaneously using a cycleGAN framework without matched data.
arXiv Detail & Related papers (2020-11-20T16:10:08Z) - Frequency Gating: Improved Convolutional Neural Networks for Speech
Enhancement in the Time-Frequency Domain [37.722450363816144]
We introduce a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN.
Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline.
A loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.
arXiv Detail & Related papers (2020-11-08T22:04:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.