Non-Contrastive Self-supervised Learning for Utterance-Level Information
Extraction from Speech
- URL: http://arxiv.org/abs/2208.05445v1
- Date: Wed, 10 Aug 2022 16:56:39 GMT
- Title: Non-Contrastive Self-supervised Learning for Utterance-Level Information
Extraction from Speech
- Authors: Jaejin Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak
- Abstract summary: Self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications.
This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings.
- Score: 32.058337220323146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent studies, self-supervised pre-trained models tend to outperform
supervised pre-trained models in transfer learning. In particular,
self-supervised learning (SSL) of utterance-level speech representation can be
used in speech applications that require discriminative representation of
consistent attributes within an utterance: speaker, language, emotion, and age.
Existing frame-level self-supervised speech representation, e.g., wav2vec, can
be used as utterance-level representation with pooling, but the models are
usually large. There are also SSL techniques to learn utterance-level
representation. One of the most successful is a contrastive method, which
requires negative sampling: selecting alternative samples to contrast with the
current sample (anchor). However, this does not ensure that all the negative
samples belong to classes different from the anchor class without labels. This
paper applies a non-contrastive self-supervised method to learn utterance-level
embeddings. We adapted DIstillation with NO labels (DINO) from computer vision
to speech. Unlike contrastive methods, DINO does not require negative sampling.
We compared DINO to x-vector trained in a supervised manner. When transferred
to down-stream tasks (speaker verification, speech emotion recognition (SER),
and Alzheimer's disease detection), DINO outperformed x-vector. We studied the
influence of several aspects during transfer learning such as dividing the
fine-tuning process into steps, chunk lengths, or augmentation. During
fine-tuning, tuning the last affine layers first and then the whole network
surpassed fine-tuning all at once. Using shorter chunk lengths, although they
generate more diverse inputs, did not necessarily improve performance, implying
speech segments at least with a specific length are required for better
performance per application. Augmentation was helpful in SER.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Non-Contrastive Self-Supervised Learning of Utterance-Level Speech
Representations [37.665503758394934]
This paper applies a non-contrastive self-supervised learning method on an unlabeled speech corpus to learn utterance-level embeddings.
Unlike the contrastive methods, DINO does not require negative sampling.
These embeddings were evaluated on speaker verification and emotion recognition.
arXiv Detail & Related papers (2022-08-10T16:04:23Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - Adversarial Training with Contrastive Learning in NLP [0.0]
We propose adversarial training with contrastive learning (ATCL) to adversarially train a language processing task.
The core idea is to make linear perturbations in the embedding space of the input via fast gradient methods (FGM) and train the model to keep the original and perturbed representations close via contrastive learning.
The results show not only an improvement in the quantitative (perplexity and BLEU) scores when compared to the baselines, but ATCL also achieves good qualitative results in the semantic level for both tasks.
arXiv Detail & Related papers (2021-09-19T07:23:45Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Non-Autoregressive Predictive Coding for Learning Speech Representations
from Local Dependencies [91.92060221982064]
We propose Non-Autoregressive Predictive Coding (NPC), a self-supervised method to learn a speech representation in a non-autoregressive manner.
NPC has a conceptually simple objective and can be implemented easily with the introduced Masked Convolution Blocks.
We show that the NPC representation is comparable to other methods in speech experiments on phonetic and speaker classification while being more efficient.
arXiv Detail & Related papers (2020-11-01T02:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.