BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
- URL: http://arxiv.org/abs/2309.02836v2
- Date: Mon, 25 Mar 2024 03:17:30 GMT
- Title: BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
- Authors: Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji,
- Abstract summary: Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time.
Most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space.
We propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN.
- Score: 16.986061375767488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.
Related papers
- VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders [14.222389985736422]
VNet is a GAN-based neural vocoder network that incorporates full-band spectral information.
We demonstrate that the VNet model is capable of generating high-fidelity speech.
arXiv Detail & Related papers (2024-08-13T14:00:02Z) - HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise
Filter and Inverse Short Time Fourier Transform [21.896817015593122]
We introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain.
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN.
Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications.
arXiv Detail & Related papers (2023-09-18T05:30:15Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - LD-GAN: Low-Dimensional Generative Adversarial Network for Spectral
Image Generation with Variance Regularization [72.4394510913927]
Deep learning methods are state-of-the-art for spectral image (SI) computational tasks.
GANs enable diverse augmentation by learning and sampling from the data distribution.
GAN-based SI generation is challenging since the high-dimensionality nature of this kind of data hinders the convergence of the GAN training yielding to suboptimal generation.
We propose a statistical regularization to control the low-dimensional representation variance for the autoencoder training and to achieve high diversity of samples generated with the GAN.
arXiv Detail & Related papers (2023-04-29T00:25:02Z) - SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer [20.667910240515482]
Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives.
This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution.
We propose a novel GAN training scheme, called slicing adversarial network (SAN)
arXiv Detail & Related papers (2023-01-30T12:03:44Z) - WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis [4.689359813220365]
We propose an effective and lightweight neural vocoder called WOLONet.
In this paper, we develop a novel lightweight block that uses a location-variable, channel-independent, and depthwise dynamic convolutional kernel with sinusoidally activated dynamic kernel weights.
The results show that our WOLONet achieves the best generation quality while requiring fewer parameters than the two neural SOTA vocoders, HiFiGAN and UnivNet.
arXiv Detail & Related papers (2022-06-20T17:58:52Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - RefineGAN: Universally Generating Waveform Better than Ground Truth with
Highly Accurate Pitch and Intensity Responses [15.599745604729842]
We propose RefineGAN, a high-fidelity neural vocoder with faster-than-real-time generation capability.
We employ a pitch-guided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process.
We show that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by the speaker.
arXiv Detail & Related papers (2021-11-01T14:12:54Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - Variational Autoencoders: A Harmonic Perspective [79.49579654743341]
We study Variational Autoencoders (VAEs) from the perspective of harmonic analysis.
We show that the encoder variance of a VAE controls the frequency content of the functions parameterised by the VAE encoder and decoder neural networks.
arXiv Detail & Related papers (2021-05-31T10:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.