End-To-End Dilated Variational Autoencoder with Bottleneck
Discriminative Loss for Sound Morphing -- A Preliminary Study
- URL: http://arxiv.org/abs/2011.09744v1
- Date: Thu, 19 Nov 2020 09:47:13 GMT
- Title: End-To-End Dilated Variational Autoencoder with Bottleneck
Discriminative Loss for Sound Morphing -- A Preliminary Study
- Authors: Matteo Lionello and Hendrik Purwins
- Abstract summary: We present a preliminary study on an end-to-end variational autoencoder (VAE) for sound morphing.
Two VAE variants are compared: VAE with dilation layers (DC-VAE) and VAE only with regular convolutional layers (CC-VAE)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a preliminary study on an end-to-end variational autoencoder (VAE)
for sound morphing. Two VAE variants are compared: VAE with dilation layers
(DC-VAE) and VAE only with regular convolutional layers (CC-VAE). We combine
the following loss functions: 1) the time-domain mean-squared error for
reconstructing the input signal, 2) the Kullback-Leibler divergence to the
standard normal distribution in the bottleneck layer, and 3) the classification
loss calculated from the bottleneck representation. On a database of spoken
digits, we use 1-nearest neighbor classification to show that the sound classes
separate in the bottleneck layer. We introduce the Mel-frequency cepstrum
coefficient dynamic time warping (MFCC-DTW) deviation as a measure of how well
the VAE decoder projects the class center in the latent (bottleneck) layer to
the center of the sounds of that class in the audio domain. In terms of
MFCC-DTW deviation and 1-NN classification, DC-VAE outperforms CC-VAE. These
results for our parametrization and our dataset indicate that DC-VAE is more
suitable for sound morphing than CC-VAE, since the DC-VAE decoder better
preserves the topology when mapping from the audio domain to the latent space.
Examples are given both for morphing spoken digits and drum sounds.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Audio classification with Dilated Convolution with Learnable Spacings [10.89964981012741]
Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation.
Here we show that DCLS is also useful for audio tagging using the AudioSet classification benchmark.
arXiv Detail & Related papers (2023-09-25T09:09:54Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Chain-based Discriminative Autoencoders for Speech Recognition [16.21321835306968]
We propose three new versions of a discriminative autoencoder (DcAE) for speech recognition.
First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used.
For application to robust speech recognition, we extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE.
arXiv Detail & Related papers (2022-03-25T14:51:48Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Conditional Deep Hierarchical Variational Autoencoder for Voice
Conversion [5.538544897623972]
Variational autoencoder-based voice conversion (VAE-VC) has the advantage of requiring only pairs of speeches and speaker labels for training.
This paper investigates how an increasing model expressiveness has benefits and impacts on the VAE-VC.
arXiv Detail & Related papers (2021-12-06T05:54:11Z) - Consistency Regularization for Variational Auto-Encoders [14.423556966548544]
Variational auto-encoders (VAEs) are a powerful approach to unsupervised learning.
We propose a regularization method to enforce consistency in VAEs.
arXiv Detail & Related papers (2021-05-31T10:26:32Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.