Conditional variational autoencoder to improve neural audio synthesis
for polyphonic music sound
- URL: http://arxiv.org/abs/2211.08715v1
- Date: Wed, 16 Nov 2022 07:11:56 GMT
- Title: Conditional variational autoencoder to improve neural audio synthesis
for polyphonic music sound
- Authors: Seokjin Lee, Minhan Kim, Seunghyeon Shin, Daeho Lee, Inseon Jang, and
Wootaek Lim
- Abstract summary: realtime audio variational autoencoder (RAVE) method was developed for high-quality audio waveform synthesis.
We propose an enhanced RAVE model with a conditional variational autoencoder structure and an additional fully-connected layer.
The proposed model exhibits a more significant performance and stability improvement than the conventional RAVE model.
- Score: 4.002298833349517
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep generative models for audio synthesis have recently been significantly
improved. However, the task of modeling raw-waveforms remains a difficult
problem, especially for audio waveforms and music signals. Recently, the
realtime audio variational autoencoder (RAVE) method was developed for
high-quality audio waveform synthesis. The RAVE method is based on the
variational autoencoder and utilizes the two-stage training strategy.
Unfortunately, the RAVE model is limited in reproducing wide-pitch polyphonic
music sound. Therefore, to enhance the reconstruction performance, we adopt the
pitch activation data as an auxiliary information to the RAVE model. To handle
the auxiliary information, we propose an enhanced RAVE model with a conditional
variational autoencoder structure and an additional fully-connected layer. To
evaluate the proposed structure, we conducted a listening experiment based on
multiple stimulus tests with hidden references and an anchor (MUSHRA) with the
MAESTRO. The obtained results indicate that the proposed model exhibits a more
significant performance and stability improvement than the conventional RAVE
model.
Related papers
- Model and Deep learning based Dynamic Range Compression Inversion [12.002024727237837]
Inverting DRC can help to restore the original dynamics to produce new mixes and/or to improve the overall quality of the audio signal.
We propose a model-based approach with neural networks for DRC inversion.
Our results show the effectiveness and robustness of the proposed method in comparison to several state-of-the-art methods.
arXiv Detail & Related papers (2024-11-07T00:33:07Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - An investigation of the reconstruction capacity of stacked convolutional
autoencoders for log-mel-spectrograms [2.3204178451683264]
In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand.
Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument compression.
This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch.
arXiv Detail & Related papers (2023-01-18T17:19:04Z) - Synthetic Wave-Geometric Impulse Responses for Improved Speech
Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation.
We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.