Unsupervised Speech Enhancement using Dynamical Variational
Auto-Encoders
- URL: http://arxiv.org/abs/2106.12271v1
- Date: Wed, 23 Jun 2021 09:48:38 GMT
- Title: Unsupervised Speech Enhancement using Dynamical Variational
Auto-Encoders
- Authors: Xiaoyu Bie, Simon Leglaive, Xavier Alameda-Pineda, Laurent Girin
- Abstract summary: Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables.
We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs.
We derive a variational expectation-maximization algorithm to perform speech enhancement.
- Score: 29.796695365217893
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dynamical variational auto-encoders (DVAEs) are a class of deep generative
models with latent variables, dedicated to time series data modeling. DVAEs can
be considered as extensions of the variational autoencoder (VAE) that include
the modeling of temporal dependencies between successive observed and/or latent
vectors in data sequences. Previous work has shown the interest of DVAEs and
their better performance over the VAE for speech signals (spectrogram)
modeling. Independently, the VAE has been successfully applied to speech
enhancement in noise, in an unsupervised noise-agnostic set-up that does not
require the use of a parallel dataset of clean and noisy speech samples for
training, but only requires clean speech signals. In this paper, we extend
those works to DVAE-based single-channel unsupervised speech enhancement, hence
exploiting both speech signals unsupervised representation learning and
dynamics modeling. We propose an unsupervised speech enhancement algorithm
based on the most general form of DVAEs, that we then adapt to three specific
DVAE models to illustrate the versatility of the framework. More precisely, we
combine DVAE-based speech priors with a noise model based on nonnegative matrix
factorization, and we derive a variational expectation-maximization (VEM)
algorithm to perform speech enhancement. Experimental results show that the
proposed approach based on DVAEs outperforms its VAE counterpart and a
supervised speech enhancement baseline.
Related papers
- Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - Unsupervised speech enhancement with deep dynamical generative speech
and noise models [26.051535142743166]
This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model.
We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both.
arXiv Detail & Related papers (2023-06-13T14:52:35Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Speech Modeling with a Hierarchical Transformer Dynamical VAE [23.847366888695266]
We propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE)
We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure.
arXiv Detail & Related papers (2023-03-07T13:35:45Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.