Non-Contrastive Self-Supervised Learning of Utterance-Level Speech
Representations
- URL: http://arxiv.org/abs/2208.05413v1
- Date: Wed, 10 Aug 2022 16:04:23 GMT
- Title: Non-Contrastive Self-Supervised Learning of Utterance-Level Speech
Representations
- Authors: Jaejin Cho, Raghavendra Pappagari, Piotr \.Zelasko, Laureano
Moro-Velazquez, Jes\'us Villalba, Najim Dehak
- Abstract summary: This paper applies a non-contrastive self-supervised learning method on an unlabeled speech corpus to learn utterance-level embeddings.
Unlike the contrastive methods, DINO does not require negative sampling.
These embeddings were evaluated on speaker verification and emotion recognition.
- Score: 37.665503758394934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Considering the abundance of unlabeled speech data and the high labeling
costs, unsupervised learning methods can be essential for better system
development. One of the most successful methods is contrastive self-supervised
methods, which require negative sampling: sampling alternative samples to
contrast with the current sample (anchor). However, it is hard to ensure if all
the negative samples belong to classes different from the anchor class without
labels. This paper applies a non-contrastive self-supervised learning method on
an unlabeled speech corpus to learn utterance-level embeddings. We used
DIstillation with NO labels (DINO), proposed in computer vision, and adapted it
to the speech domain. Unlike the contrastive methods, DINO does not require
negative sampling. These embeddings were evaluated on speaker verification and
emotion recognition. In speaker verification, the unsupervised DINO embedding
with cosine scoring provided 4.38% EER on the VoxCeleb1 test trial. This
outperforms the best contrastive self-supervised method by 40% relative in EER.
An iterative pseudo-labeling training pipeline, not requiring speaker labels,
further improved the EER to 1.89%. In emotion recognition, the DINO embedding
performed 60.87, 79.21, and 56.98% in micro-f1 score on IEMOCAP, Crema-D, and
MSP-Podcast, respectively. The results imply the generality of the DINO
embedding to different speech applications.
Related papers
- DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Adaptive Integration of Partial Label Learning and Negative Learning for
Enhanced Noisy Label Learning [23.847160480176697]
We propose a simple yet powerful idea called textbfNPN, which revolutionizes textbfNoisy label learning.
We generate reliable complementary labels using all non-candidate labels for NL to enhance model robustness through indirect supervision.
Experiments conducted on both synthetically corrupted and real-world noisy datasets demonstrate the superiority of NPN compared to other state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2023-12-15T03:06:19Z) - Robust Positive-Unlabeled Learning via Noise Negative Sample
Self-correction [48.929877651182885]
Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature.
We propose a new robust PU learning method with a training strategy motivated by the nature of human learning.
arXiv Detail & Related papers (2023-08-01T04:34:52Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Non-Contrastive Self-supervised Learning for Utterance-Level Information
Extraction from Speech [32.058337220323146]
Self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications.
This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings.
arXiv Detail & Related papers (2022-08-10T16:56:39Z) - Improving Contrastive Learning of Sentence Embeddings with
Case-Augmented Positives and Retrieved Negatives [17.90820242798732]
Unsupervised contrastive learning methods still lag far behind the supervised counterparts.
We propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence.
For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model.
arXiv Detail & Related papers (2022-06-06T09:46:12Z) - Context-based Virtual Adversarial Training for Text Classification with
Noisy Labels [1.9508698179748525]
We propose context-based virtual adversarial training (ConVAT) to prevent a text classifier from overfitting to noisy labels.
Unlike the previous works, the proposed method performs the adversarial training at the context level rather than the inputs.
We conduct extensive experiments on four text classification datasets with two types of label noises.
arXiv Detail & Related papers (2022-05-29T14:19:49Z) - UNICON: Combating Label Noise Through Uniform Selection and Contrastive
Learning [89.56465237941013]
We propose UNICON, a simple yet effective sample selection method which is robust to high label noise.
We obtain an 11.4% improvement over the current state-of-the-art on CIFAR100 dataset with a 90% noise rate.
arXiv Detail & Related papers (2022-03-28T07:36:36Z) - Multi-Objective Interpolation Training for Robustness to Label Noise [17.264550056296915]
We show that standard supervised contrastive learning degrades in the presence of label noise.
We propose a novel label noise detection method that exploits the robust feature representations learned via contrastive learning.
Experiments on synthetic and real-world noise benchmarks demonstrate that MOIT/MOIT+ achieves state-of-the-art results.
arXiv Detail & Related papers (2020-12-08T15:01:54Z) - Contrastive Learning with Hard Negative Samples [80.12117639845678]
We develop a new family of unsupervised sampling methods for selecting hard negative samples.
A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible.
The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.
arXiv Detail & Related papers (2020-10-09T14:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.