LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT
- URL: http://arxiv.org/abs/2203.15610v1
- Date: Tue, 29 Mar 2022 14:20:55 GMT
- Title: LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT
- Authors: Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei,
Yu Zhang, Tom Ko, Haizhou Li
- Abstract summary: We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
- Score: 69.77358429702873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised speech representation learning has shown promising results in
various speech processing tasks. However, the pre-trained models, e.g., HuBERT,
are storage-intensive Transformers, limiting their scope of applications under
low-resource settings. To this end, we propose LightHuBERT, a once-for-all
Transformer compression framework, to find the desired architectures
automatically by pruning structured parameters. More precisely, we create a
Transformer-based supernet that is nested with thousands of weight-sharing
subnets and design a two-stage distillation strategy to leverage the
contextualized latent representations from HuBERT. Experiments on automatic
speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT
enables over $10^9$ architectures concerning the embedding dimension, attention
dimension, head number, feed-forward network ratio, and network depth.
LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with
the HuBERT size, achieves comparable performance to the teacher model in most
tasks with a reduction of 29% parameters, and obtains a $3.5\times$ compression
ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword
spotting, and intent classification, with a slight accuracy loss. The code and
pre-trained models are available at
https://github.com/mechanicalsea/lighthubert.
Related papers
- Recycle-and-Distill: Universal Compression Strategy for
Transformer-based Speech SSL Models with Attention Map Reusing and Masking
Distillation [32.97898981684483]
Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks.
Huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies.
arXiv Detail & Related papers (2023-05-19T14:07:43Z) - oBERTa: Improving Sparse Transfer Learning via improved initialization,
distillation, and pruning regimes [82.99830498937729]
oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
arXiv Detail & Related papers (2023-03-30T01:37:19Z) - Structured Pruning of Self-Supervised Pre-trained Models for Speech
Recognition and Understanding [43.68557263195205]
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.
We propose three task-specific structured pruning methods to deal with such heterogeneous networks.
Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vecbase with 10% to 30% less, and is able to reduce the computation by 40% to 50% without any degradation.
arXiv Detail & Related papers (2023-02-27T20:39:54Z) - Application of Knowledge Distillation to Multi-task Speech
Representation Learning [2.0908300719428228]
Speech representation learning models use a large number of parameters, the smallest version of which has 95 million parameters.
In this paper, we investigate the application of knowledge distillation to speech representation learning models followed by fine-tuning.
Our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation.
arXiv Detail & Related papers (2022-10-29T14:22:43Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Students Need More Attention: BERT-based AttentionModel for Small Data
with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining)
We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets.
As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z) - Simplified Self-Attention for Transformer-based End-to-End Speech
Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors.
We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z) - schuBERT: Optimizing Elements of BERT [22.463154358632472]
We revisit the architecture choices of BERT in efforts to obtain a lighter model.
We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions.
In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
arXiv Detail & Related papers (2020-05-09T21:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.