Disentanglement Learning for Variational Autoencoders Applied to
Audio-Visual Speech Enhancement
- URL: http://arxiv.org/abs/2105.08970v1
- Date: Wed, 19 May 2021 07:42:14 GMT
- Title: Disentanglement Learning for Variational Autoencoders Applied to
Audio-Visual Speech Enhancement
- Authors: Guillaume Carbajal, Julius Richter, Timo Gerkmann
- Abstract summary: We propose an adversarial training scheme for variational autoencoders to disentangle the label from the other latent variables.
We show the benefit of the proposed disentanglement learning when a voice activity label, estimated from visual data, is used for speech enhancement.
- Score: 20.28217079480463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the standard variational autoencoder has been successfully used to
learn a probabilistic prior over speech signals, which is then used to perform
speech enhancement. Variational autoencoders have then been conditioned on a
label describing a high-level speech attribute (e.g. speech activity) that
allows for a more explicit control of speech generation. However, the label is
not guaranteed to be disentangled from the other latent variables, which
results in limited performance improvements compared to the standard
variational autoencoder. In this work, we propose to use an adversarial
training scheme for variational autoencoders to disentangle the label from the
other latent variables. At training, we use a discriminator that competes with
the encoder of the variational autoencoder. Simultaneously, we also use an
additional encoder that estimates the label for the decoder of the variational
autoencoder, which proves to be crucial to learn disentanglement. We show the
benefit of the proposed disentanglement learning when a voice activity label,
estimated from visual data, is used for speech enhancement.
Related papers
- Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement [1.4037575966075835]
1-D filters on raw audio are hard to train and often suffer from instabilities.
We address these problems with hybrid solutions, combining theory-driven and data-driven approaches.
arXiv Detail & Related papers (2024-08-30T15:49:31Z) - Towards General-Purpose Text-Instruction-Guided Voice Conversion [84.78206348045428]
This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
arXiv Detail & Related papers (2023-09-25T17:52:09Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition [29.1423215212174]
Recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR)
We propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process.
Experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%.
arXiv Detail & Related papers (2022-10-25T08:12:59Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Guided Variational Autoencoder for Speech Enhancement With a Supervised
Classifier [20.28217079480463]
We propose to guide the variational autoencoder with a supervised classifier separately trained on noisy speech.
The estimated label is a high-level categorical variable describing the speech signal.
We evaluate our method with different types of labels on real recordings of different noisy environments.
arXiv Detail & Related papers (2021-02-12T11:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.