DinoSR: Self-Distillation and Online Clustering for Self-supervised
Speech Representation Learning
- URL: http://arxiv.org/abs/2305.10005v2
- Date: Tue, 16 Jan 2024 05:43:20 GMT
- Title: DinoSR: Self-Distillation and Online Clustering for Self-supervised
Speech Representation Learning
- Authors: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R.
Glass
- Abstract summary: We introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR)
DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network.
We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
- Score: 140.96990096377127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce self-distillation and online clustering for
self-supervised speech representation learning (DinoSR) which combines masked
language modeling, self-distillation, and online clustering. We show that these
concepts complement each other and result in a strong representation learning
model for speech. DinoSR first extracts contextualized embeddings from the
input audio with a teacher network, then runs an online clustering system on
the embeddings to yield a machine-discovered phone inventory, and finally uses
the discretized tokens to guide a student network. We show that DinoSR
surpasses previous state-of-the-art performance in several downstream tasks,
and provide a detailed analysis of the model and the learned discrete units.
Related papers
- A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Distilling Knowledge from Self-Supervised Teacher by Embedding Graph
Alignment [52.704331909850026]
We formulate a new knowledge distillation framework to transfer the knowledge from self-supervised pre-trained models to any other student network.
Inspired by the spirit of instance discrimination in self-supervised learning, we model the instance-instance relations by a graph formulation in the feature embedding space.
Our distillation scheme can be flexibly applied to transfer the self-supervised knowledge to enhance representation learning on various student networks.
arXiv Detail & Related papers (2022-11-23T19:27:48Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Knowledge Distillation By Sparse Representation Matching [107.87219371697063]
We propose Sparse Representation Matching (SRM) to transfer intermediate knowledge from one Convolutional Network (CNN) to another by utilizing sparse representation.
We formulate as a neural processing block, which can be efficiently optimized using gradient descent and integrated into any CNN in a plug-and-play manner.
Our experiments demonstrate that is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.
arXiv Detail & Related papers (2021-03-31T11:47:47Z) - Distilling Visual Priors from Self-Supervised Learning [24.79633121345066]
Convolutional Neural Networks (CNNs) are prone to overfit small training datasets.
We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting.
arXiv Detail & Related papers (2020-08-01T13:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.