CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-supervised
learning of speech representations
- URL: http://arxiv.org/abs/2210.02592v3
- Date: Sat, 13 May 2023 21:42:19 GMT
- Title: CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-supervised
learning of speech representations
- Authors: Vasista Sai Lodagala and Sreyan Ghosh and S. Umesh
- Abstract summary: We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective.
ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model.
- Score: 1.2031796234206138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Self-Supervised Learning has helped reap the benefit of the scale from
the available unlabeled data, the learning paradigms are continuously being
bettered. We present a new pre-training strategy named ccc-wav2vec 2.0, which
uses clustering and an augmentation-based cross-contrastive loss as its
self-supervised objective. Through the clustering module, we scale down the
influence of those negative examples that are highly similar to the positive.
The Cross-Contrastive loss is computed between the encoder output of the
original sample and the quantizer output of its augmentation and vice-versa,
bringing robustness to the pre-training strategy. ccc-wav2vec 2.0 achieves up
to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on
the test-clean and test-other sets, respectively, of LibriSpeech, without the
use of any language model. The proposed method also achieves up to 14.9%
relative WER improvement over the baseline wav2vec 2.0 when fine-tuned on
Switchboard data. We make all our codes publicly available on GitHub.
Related papers
- Stuttering Detection Using Speaker Representations and Self-supervised
Contextual Embeddings [7.42741711946564]
We introduce the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks.
In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines.
arXiv Detail & Related papers (2023-06-01T14:00:47Z) - Cluster-guided Contrastive Graph Clustering Network [53.16233290797777]
We propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC)
We construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks.
To construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples.
arXiv Detail & Related papers (2023-01-03T13:42:38Z) - GraphLearner: Graph Node Clustering with Fully Learnable Augmentation [76.63963385662426]
Contrastive deep graph clustering (CDGC) leverages the power of contrastive learning to group nodes into different clusters.
We propose a Graph Node Clustering with Fully Learnable Augmentation, termed GraphLearner.
It introduces learnable augmentors to generate high-quality and task-specific augmented samples for CDGC.
arXiv Detail & Related papers (2022-12-07T10:19:39Z) - C3: Cross-instance guided Contrastive Clustering [8.953252452851862]
Clustering is the task of gathering similar data samples into clusters without using any predefined labels.
We propose a novel contrastive clustering method, Cross-instance guided Contrastive Clustering (C3)
Our proposed method can outperform state-of-the-art algorithms on benchmark computer vision datasets.
arXiv Detail & Related papers (2022-11-14T06:28:07Z) - data2vec-aqc: Search for the right Teaching Assistant in the
Teacher-Student training setup [1.2031796234206138]
We propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc.
Our goal is to improve SSL for speech in domains where both unlabeled and labeled data are limited.
arXiv Detail & Related papers (2022-11-02T16:29:59Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - Semi-supervised Contrastive Learning with Similarity Co-calibration [72.38187308270135]
We propose a novel training strategy, termed as Semi-supervised Contrastive Learning (SsCL)
SsCL combines the well-known contrastive loss in self-supervised learning with the cross entropy loss in semi-supervised learning.
We show that SsCL produces more discriminative representation and is beneficial to few shot learning.
arXiv Detail & Related papers (2021-05-16T09:13:56Z) - Improving Calibration for Long-Tailed Recognition [68.32848696795519]
We propose two methods to improve calibration and performance in such scenarios.
For dataset bias due to different samplers, we propose shifted batch normalization.
Our proposed methods set new records on multiple popular long-tailed recognition benchmark datasets.
arXiv Detail & Related papers (2021-04-01T13:55:21Z) - Doubly Contrastive Deep Clustering [135.7001508427597]
We present a novel Doubly Contrastive Deep Clustering (DCDC) framework, which constructs contrastive loss over both sample and class views.
Specifically, for the sample view, we set the class distribution of the original sample and its augmented version as positive sample pairs.
For the class view, we build the positive and negative pairs from the sample distribution of the class.
In this way, two contrastive losses successfully constrain the clustering results of mini-batch samples in both sample and class level.
arXiv Detail & Related papers (2021-03-09T15:15:32Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Supervised Contrastive Learning [42.27949000093086]
We extend the self-supervised batch contrastive approach to the fully-supervised setting.
We analyze two possible versions of the supervised contrastive (SupCon) loss, identifying the best-performing formulation of the loss.
On ResNet-200, we achieve top-1 accuracy of 81.4% on the ImageNet dataset, which is 0.8% above the best number reported for this architecture.
arXiv Detail & Related papers (2020-04-23T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.