Recycle-and-Distill: Universal Compression Strategy for
Transformer-based Speech SSL Models with Attention Map Reusing and Masking
Distillation
- URL: http://arxiv.org/abs/2305.11685v2
- Date: Thu, 26 Oct 2023 10:43:07 GMT
- Title: Recycle-and-Distill: Universal Compression Strategy for
Transformer-based Speech SSL Models with Attention Map Reusing and Masking
Distillation
- Authors: Kangwook Jang, Sungnyun Kim, Se-Young Yun, Hoirin Kim
- Abstract summary: Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks.
Huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies.
- Score: 32.97898981684483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based speech self-supervised learning (SSL) models, such as
HuBERT, show surprising performance in various speech processing tasks.
However, huge number of parameters in speech SSL models necessitate the
compression to a more compact model for wider usage in academia or small
companies. In this study, we suggest to reuse attention maps across the
Transformer layers, so as to remove key and query parameters while retaining
the number of layers. Furthermore, we propose a novel masking distillation
strategy to improve the student model's speech representation quality. We
extend the distillation loss to utilize both masked and unmasked speech frames
to fully leverage the teacher model's high-quality representation. Our
universal compression strategy yields the student model that achieves phoneme
error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB
benchmark.
Related papers
- Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility [15.463932957443973]
Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions.
MaskSR is a recently proposed generative model for this task.
We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility.
arXiv Detail & Related papers (2024-09-14T08:09:55Z) - MaskSR: Masked Language Model for Full-band Speech Restoration [7.015213589171985]
Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions.
We propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth.
arXiv Detail & Related papers (2024-06-04T08:23:57Z) - STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models [10.07318014676215]
We propose to compress the speech SSL models by distilling speech temporal relation (STaR)
Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters.
arXiv Detail & Related papers (2023-12-14T15:37:37Z) - Miipher: A Robust Speech Restoration Model Integrating Self-Supervised
Speech and Text Representations [51.89856133895233]
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones.
In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application.
To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature.
arXiv Detail & Related papers (2023-03-03T01:57:16Z) - Structured Pruning of Self-Supervised Pre-trained Models for Speech
Recognition and Understanding [43.68557263195205]
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.
We propose three task-specific structured pruning methods to deal with such heterogeneous networks.
Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vecbase with 10% to 30% less, and is able to reduce the computation by 40% to 50% without any degradation.
arXiv Detail & Related papers (2023-02-27T20:39:54Z) - Ultra Fast Speech Separation Model with Teacher Student Learning [44.71171732510265]
An ultra fast Transformer model is proposed to achieve better performance and efficiency with teacher student learning (T-S learning)
Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation.
arXiv Detail & Related papers (2022-04-27T09:02:45Z) - LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models [49.8019791766848]
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time.
In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model.
Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
arXiv Detail & Related papers (2021-10-16T10:04:14Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.