Non-Autoregressive Predictive Coding for Learning Speech Representations
from Local Dependencies
- URL: http://arxiv.org/abs/2011.00406v1
- Date: Sun, 1 Nov 2020 02:48:37 GMT
- Title: Non-Autoregressive Predictive Coding for Learning Speech Representations
from Local Dependencies
- Authors: Alexander H. Liu, Yu-An Chung, James Glass
- Abstract summary: We propose Non-Autoregressive Predictive Coding (NPC), a self-supervised method to learn a speech representation in a non-autoregressive manner.
NPC has a conceptually simple objective and can be implemented easily with the introduced Masked Convolution Blocks.
We show that the NPC representation is comparable to other methods in speech experiments on phonetic and speaker classification while being more efficient.
- Score: 91.92060221982064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised speech representations have been shown to be effective in a
variety of speech applications. However, existing representation learning
methods generally rely on the autoregressive model and/or observed global
dependencies while generating the representation. In this work, we propose
Non-Autoregressive Predictive Coding (NPC), a self-supervised method, to learn
a speech representation in a non-autoregressive manner by relying only on local
dependencies of speech. NPC has a conceptually simple objective and can be
implemented easily with the introduced Masked Convolution Blocks. NPC offers a
significant speedup for inference since it is parallelizable in time and has a
fixed inference time for each time step regardless of the input sequence
length. We discuss and verify the effectiveness of NPC by theoretically and
empirically comparing it with other methods. We show that the NPC
representation is comparable to other methods in speech experiments on phonetic
and speaker classification while being more efficient.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel)
Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem.
Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Autoregressive Co-Training for Learning Discrete Speech Representations [19.400428010647573]
We consider a generative model with discrete latent variables that learns a discrete representation for speech.
We find that the proposed approach learns discrete representation that is highly correlated with phonetic units.
arXiv Detail & Related papers (2022-03-29T18:17:18Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - DirectProbe: Studying Representations without Classifiers [21.23284793831221]
DirectProbe studies the geometry of a representation by building upon the notion of a version space for a task.
Experiments with several linguistic tasks and contextualized embeddings show that, even without training classifiers, DirectProbe can shine light into how an embedding space represents labels.
arXiv Detail & Related papers (2021-04-13T02:40:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.