A weighted-variance variational autoencoder model for speech enhancement
- URL: http://arxiv.org/abs/2211.00990v2
- Date: Thu, 26 Oct 2023 11:47:25 GMT
- Title: A weighted-variance variational autoencoder model for speech enhancement
- Authors: Ali Golmakani (MULTISPEECH), Mostafa Sadeghi (MULTISPEECH), Xavier
Alameda-Pineda (ROBOTLEARN), Romain Serizel (MULTISPEECH)
- Abstract summary: We propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted.
We develop efficient training and speech enhancement algorithms based on the proposed generative model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address speech enhancement based on variational autoencoders, which
involves learning a speech prior distribution in the time-frequency (TF)
domain. A zero-mean complex-valued Gaussian distribution is usually assumed for
the generative model, where the speech information is encoded in the variance
as a function of a latent variable. In contrast to this commonly used approach,
we propose a weighted variance generative model, where the contribution of each
spectrogram time-frame in parameter learning is weighted. We impose a Gamma
prior distribution on the weights, which would effectively lead to a Student's
t-distribution instead of Gaussian for speech generative modeling. We develop
efficient training and speech enhancement algorithms based on the proposed
generative model. Our experimental results on spectrogram auto-encoding and
speech enhancement demonstrate the effectiveness and robustness of the proposed
approach compared to the standard unweighted variance model.
Related papers
- Disentanglement with Factor Quantized Variational Autoencoders [11.086500036180222]
We propose a discrete variational autoencoder (VAE) based model where the ground truth information about the generative factors are not provided to the model.
We demonstrate the advantages of learning discrete representations over learning continuous representations in facilitating disentanglement.
Our method called FactorQVAE is the first method that combines optimization based disentanglement approaches with discrete representation learning.
arXiv Detail & Related papers (2024-09-23T09:33:53Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Unsupervised speech enhancement with diffusion-based generative models [0.0]
We introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models.
We develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference.
We show promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method.
arXiv Detail & Related papers (2023-09-19T09:11:31Z) - Dior-CVAE: Pre-trained Language Models and Diffusion Priors for
Variational Dialog Generation [70.2283756542824]
Dior-CVAE is a hierarchical conditional variational autoencoder (CVAE) with diffusion priors to address these challenges.
We employ a diffusion model to increase the complexity of the prior distribution and its compatibility with the distributions produced by a PLM.
Experiments across two commonly used open-domain dialog datasets show that our method can generate more diverse responses without large-scale dialog pre-training.
arXiv Detail & Related papers (2023-05-24T11:06:52Z) - Fast and efficient speech enhancement with variational autoencoders [0.0]
Unsupervised speech enhancement based on variational autoencoders has shown promising performance compared with the commonly used supervised methods.
We propose a new approach based on Langevin dynamics that generates multiple sequences of samples and comes with a total variation-based regularization to incorporate temporal correlations of latent vectors.
Our experiments demonstrate that the developed framework makes an effective compromise between computational efficiency and enhancement quality, and outperforms existing methods.
arXiv Detail & Related papers (2022-11-02T09:52:13Z) - An Energy-Based Prior for Generative Saliency [62.79775297611203]
We propose a novel generative saliency prediction framework that adopts an informative energy-based model as a prior distribution.
With the generative saliency model, we can obtain a pixel-wise uncertainty map from an image, indicating model confidence in the saliency prediction.
Experimental results show that our generative saliency model with an energy-based prior can achieve not only accurate saliency predictions but also reliable uncertainty maps consistent with human perception.
arXiv Detail & Related papers (2022-04-19T10:51:00Z) - A Sparsity-promoting Dictionary Model for Variational Autoencoders [16.61511959679188]
Structuring the latent space in deep generative models is important to yield more expressive models and interpretable representations.
We propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model.
arXiv Detail & Related papers (2022-03-29T17:13:11Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.