A Comparison of Discrete Latent Variable Models for Speech
Representation Learning
- URL: http://arxiv.org/abs/2010.14230v1
- Date: Sat, 24 Oct 2020 01:22:14 GMT
- Title: A Comparison of Discrete Latent Variable Models for Speech
Representation Learning
- Authors: Henry Zhou, Alexei Baevski and Michael Auli
- Abstract summary: This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal.
Results show that future time-step prediction with vq-wav2vec achieves better performance.
- Score: 46.52258734975676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural latent variable models enable the discovery of interesting structure
in speech audio data. This paper presents a comparison of two different
approaches which are broadly based on predicting future time-steps or
auto-encoding the input signal. Our study compares the representations learned
by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme
recognition performance. Results show that future time-step prediction with
vq-wav2vec achieves better performance. The best system achieves an error rate
of 13.22 on the ZeroSpeech 2019 ABX phoneme discrimination challenge
Related papers
- Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard
Challenge 2021 [31.750875486806184]
This paper describes the Microsoft end-to-end neural text to speech (TTS) system: DelightfulTTS for Blizzard Challenge 2021.
The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives.
arXiv Detail & Related papers (2021-10-25T02:47:59Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Deep Variational Generative Models for Audio-visual Speech Separation [33.227204390773316]
We propose an unsupervised technique based on audio-visual generative modeling of clean speech.
To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech.
Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches.
arXiv Detail & Related papers (2020-08-17T10:12:33Z) - Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge [26.114011076658237]
We propose two neural models to tackle the problem of learning discrete representations of speech.
The first model is a type of vector-quantized variational autoencoder (VQ-VAE)
The second model combines vector quantization with contrastive predictive coding (VQ-CPC)
We evaluate the models on English and Indonesian data for the ZeroSpeech 2020 challenge.
arXiv Detail & Related papers (2020-05-19T13:06:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.