Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge
- URL: http://arxiv.org/abs/2005.09409v2
- Date: Wed, 19 Aug 2020 12:41:55 GMT
- Title: Vector-quantized neural networks for acoustic unit discovery in the
ZeroSpeech 2020 challenge
- Authors: Benjamin van Niekerk, Leanne Nortje, Herman Kamper
- Abstract summary: We propose two neural models to tackle the problem of learning discrete representations of speech.
The first model is a type of vector-quantized variational autoencoder (VQ-VAE)
The second model combines vector quantization with contrastive predictive coding (VQ-CPC)
We evaluate the models on English and Indonesian data for the ZeroSpeech 2020 challenge.
- Score: 26.114011076658237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore vector quantization for acoustic unit discovery.
Leveraging unlabelled data, we aim to learn discrete representations of speech
that separate phonetic content from speaker-specific details. We propose two
neural models to tackle this challenge - both use vector quantization to map
continuous features to a finite set of codes. The first model is a type of
vector-quantized variational autoencoder (VQ-VAE). The VQ-VAE encodes speech
into a sequence of discrete units before reconstructing the audio waveform. Our
second model combines vector quantization with contrastive predictive coding
(VQ-CPC). The idea is to learn a representation of speech by predicting future
acoustic units. We evaluate the models on English and Indonesian data for the
ZeroSpeech 2020 challenge. In ABX phone discrimination tests, both models
outperform all submissions to the 2019 and 2020 challenges, with a relative
improvement of more than 30%. The models also perform competitively on a
downstream voice conversion task. Of the two, VQ-CPC performs slightly better
in general and is simpler and faster to train. Finally, probing experiments
show that vector quantization is an effective bottleneck, forcing the models to
discard speaker information.
Related papers
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Self-Supervised Speech Quality Estimation and Enhancement Using Only
Clean Speech [50.95292368372455]
We propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE)
The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted.
We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training.
arXiv Detail & Related papers (2024-02-26T06:01:38Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - A Comparison of Discrete Latent Variable Models for Speech
Representation Learning [46.52258734975676]
This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal.
Results show that future time-step prediction with vq-wav2vec achieves better performance.
arXiv Detail & Related papers (2020-10-24T01:22:14Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.