RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis
- URL: http://arxiv.org/abs/2111.05011v1
- Date: Tue, 9 Nov 2021 09:07:30 GMT
- Title: RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis
- Authors: Antoine Caillon and Philippe Esling
- Abstract summary: We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
- Score: 2.28438857884398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep generative models applied to audio have improved by a large margin the
state-of-the-art in many speech and music related tasks. However, as raw
waveform modelling remains an inherently difficult task, audio generative
models are either computationally intensive, rely on low sampling rates, are
complicated to control or restrict the nature of possible signals. Among those
models, Variational AutoEncoders (VAE) give control over the generation by
exposing latent variables, although they usually suffer from low synthesis
quality. In this paper, we introduce a Realtime Audio Variational autoEncoder
(RAVE) allowing both fast and high-quality audio waveform synthesis. We
introduce a novel two-stage training procedure, namely representation learning
and adversarial fine-tuning. We show that using a post-training analysis of the
latent space allows a direct control between the reconstruction fidelity and
the representation compactness. By leveraging a multi-band decomposition of the
raw waveform, we show that our model is the first able to generate 48kHz audio
signals, while simultaneously running 20 times faster than real-time on a
standard laptop CPU. We evaluate synthesis quality using both quantitative and
qualitative subjective experiments and show the superiority of our approach
compared to existing models. Finally, we present applications of our model for
timbre transfer and signal compression. All of our source code and audio
examples are publicly available.
Related papers
- Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences.
High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Conditional variational autoencoder to improve neural audio synthesis
for polyphonic music sound [4.002298833349517]
realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis.
We propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer.
The proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.
arXiv Detail & Related papers (2022-11-16T07:11:56Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - Streamable Neural Audio Synthesis With Non-Causal Convolutions [1.8275108630751844]
We introduce a new method allowing to produce non-causal streaming models.
This allows to make any convolutional model compatible with real-time buffer-based processing.
We show how our method can be adapted to fit complex architectures with parallel branches.
arXiv Detail & Related papers (2022-04-14T16:00:32Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.