Exploiting Large-scale Teacher-Student Training for On-device Acoustic
Models
- URL: http://arxiv.org/abs/2106.06126v1
- Date: Fri, 11 Jun 2021 02:23:40 GMT
- Title: Exploiting Large-scale Teacher-Student Training for On-device Acoustic
Models
- Authors: Jing Liu, Rupak Vignesh Swaminathan, Sree Hari Krishnan Parthasarathi,
Chunchuan Lyu, Athanasios Mouchtaris, Siegfried Kunzmann
- Abstract summary: We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM)
We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR)
We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting.
- Score: 15.237992590162593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present results from Alexa speech teams on semi-supervised learning (SSL)
of acoustic models (AM) with experiments spanning over 3000 hours of GPU time,
making our study one of the largest of its kind. We discuss SSL for AMs in a
small footprint setting, showing that a smaller capacity model trained with 1
million hours of unsupervised data can outperform a baseline supervised system
by 14.3% word error rate reduction (WERR). When increasing the supervised data
to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at
larger supervised data regimes, we employ a step-wise distillation into a
smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger
student models in low data regimes; while learning efficiency with unsupervised
data is higher, student models may outperform teacher models in such a setting.
We develop a theoretical sketch to explain this behavior.
Related papers
- MiniPLM: Knowledge Distillation for Pre-Training Language Models [109.83741809808483]
MiniPLM is a KD framework for pre-training student language models.
For efficiency, MiniPLM performs offline teacher LM inference, allowing KD for multiple student LMs without adding training-time costs.
For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families.
arXiv Detail & Related papers (2024-10-22T17:40:32Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments [4.541309099803903]
This study proposes a method for knowledge distillation (KD) of fine-tuned Large Language Models (LLMs)
We specifically target the challenge of deploying these models on resource-constrained devices.
arXiv Detail & Related papers (2023-12-26T01:24:25Z) - Teaching Language Models to Self-Improve through Interactive Demonstrations [83.9421355808174]
Self-improving ability of large language models has been shown to be absent and difficult to learn for smaller models.
We introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability.
We show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%.
arXiv Detail & Related papers (2023-10-20T14:11:04Z) - Dual Learning for Large Vocabulary On-Device ASR [64.10124092250128]
Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once.
We provide an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.
arXiv Detail & Related papers (2023-01-11T06:32:28Z) - Speech separation with large-scale self-supervised learning [41.96634125460265]
Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments.
We extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours)
arXiv Detail & Related papers (2022-11-09T20:00:21Z) - FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech
Self-Supervised Learning [12.561034842067887]
We propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works.
Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT.
Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.
arXiv Detail & Related papers (2022-07-01T17:11:23Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning
for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency.
We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z) - Contrastive Semi-supervised Learning for ASR [16.070972355201253]
We propose Contrastive Semi-supervised Learning (CSL) for supervised learning of visual objects.
CSL eschews directly predicting teacher-generated pseudo-labels in favor of utilizing them to select positive and negative examples.
It reduces the WER by 8% compared to the standard Cross-Entropy pseudo-labeling (CE-PL) when 10hr of supervised data is used to annotate 75,000hr of videos.
arXiv Detail & Related papers (2021-03-09T00:20:37Z) - SEED: Self-supervised Distillation For Visual Representation [34.63488756535054]
We propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion.
We show that SEED dramatically boosts the performance of small networks on downstream tasks.
arXiv Detail & Related papers (2021-01-12T20:04:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.