Exploring Effective Distillation of Self-Supervised Speech Models for
Automatic Speech Recognition
- URL: http://arxiv.org/abs/2210.15631v3
- Date: Sun, 22 Oct 2023 13:03:09 GMT
- Title: Exploring Effective Distillation of Self-Supervised Speech Models for
Automatic Speech Recognition
- Authors: Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen and
Wei-Qiang Zhang
- Abstract summary: Miniaturization for SSL models has become an important research direction of practical value.
We explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR)
A discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios.
- Score: 5.802425107635222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed great strides in self-supervised learning (SSL)
on the speech processing. The SSL model is normally pre-trained on a great
variety of unlabelled data and a large model size is preferred to increase the
modeling capacity. However, this might limit its potential applications due to
the expensive computation and memory costs introduced by the oversize model.
Miniaturization for SSL models has become an important research direction of
practical value. To this end, we explore the effective distillation of
HuBERT-based SSL models for automatic speech recognition (ASR). First, in order
to establish a strong baseline, a comprehensive study on different student
model structures is conducted. On top of this, as a supplement to the
regression loss widely adopted in previous works, a discriminative loss is
introduced for HuBERT to enhance the distillation performance, especially in
low-resource scenarios. In addition, we design a simple and effective algorithm
to distill the front-end input from waveform to Fbank feature, resulting in 17%
parameter reduction and doubling inference speed, at marginal performance
degradation.
Related papers
- Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning [6.1339395157466425]
Foundational deep learning (DL) models are general models, trained on diverse, diverse, and unlabelled datasets.
We introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals.
arXiv Detail & Related papers (2024-11-14T23:56:57Z) - SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - Efficient Training of Self-Supervised Speech Foundation Models on a
Compute Budget [57.807614181024114]
This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget.
We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size.
arXiv Detail & Related papers (2024-09-09T10:36:42Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation [12.506633315768832]
HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task.
We present an unsupervised method to improve SSL targets.
Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training.
arXiv Detail & Related papers (2023-06-15T07:45:12Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.