Related papers: A Generative Model for Raw Audio Using Transformer Architectures

A Generative Model for Raw Audio Using Transformer Architectures

URL: http://arxiv.org/abs/2106.16036v1
Date: Wed, 30 Jun 2021 13:05:31 GMT
Title: A Generative Model for Raw Audio Using Transformer Architectures
Authors: Prateek Verma, Chris Chafe
Abstract summary: This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet citeoord2016wavenet. Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step.
Score: 4.594159253008448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet \cite{oord2016wavenet}. This is fully probabilistic, auto-regressive, and causal, i.e. each sample generated depends only on the previously observed samples. Our approach outperforms a widely used wavenet architecture by up to 9\% on a similar dataset for predicting the next step. Using the attention mechanism, we enable the architecture to learn which audio samples are important for the prediction of the future sample. We show how causal transformer generative models can be used for raw waveform synthesis. We also show that this performance can be improved by another 2\% by conditioning samples over a wider context. The flexibility of the current model to synthesize audio from latent representations suggests a large number of potential applications. The novel approach of using generative transformer architectures for raw audio synthesis is, however, still far away from generating any meaningful music, without using latent codes/meta-data to aid the generation process.

Related papers

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z)
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations. These models are prone to generate audible artifacts when the conditioning is flawed or imperfect. We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z)
RAVE: A variational autoencoder for fast and high-quality neural audio synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z)
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis. It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z)
Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals. Our model outperforms convolutional models to produce state of the art results. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z)
End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
DiffWave: A Versatile Diffusion Model for Audio Synthesis [35.406438835268816]
DiffWave is a versatile diffusion probabilistic model for conditional and unconditional waveform generation. It produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram. It significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task.
arXiv Detail & Related papers (2020-09-21T11:20:38Z)
WaveGrad: Estimating Gradients for Waveform Generation [55.405580817560754]
WaveGrad is a conditional model for waveform generation which estimates gradients of the data density. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. We find that it can generate high fidelity audio samples using as few as six iterations.
arXiv Detail & Related papers (2020-09-02T17:44:10Z)
Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization [48.55126268721948]
We present a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices. We employ a hierarchical architecture in the generator to induce some structure in the temporal dimension. We evaluate the performance of the new model not only for generating singing voices, but also for generating speech voices.
arXiv Detail & Related papers (2020-05-18T08:35:16Z)
VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.