PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
- URL: http://arxiv.org/abs/2106.05933v1
- Date: Thu, 10 Jun 2021 17:32:25 GMT
- Title: PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
- Authors: Cheng-I Jeff Lai, Yang Zhang, Alexander H. Liu, Shiyu Chang, Yi-Lun
Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, James Glass
- Abstract summary: Prune-Adjust- Re-Prune (PARP) discovers and finetunesworks for much better ASR performance.
Experiments on low-resource English and multi-lingual ASR show sparseworks exist in pre-trained speech SSL.
- Score: 78.67749936030219
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work on speech self-supervised learning (speech SSL) demonstrated the
benefits of scale in learning rich and transferable representations for
Automatic Speech Recognition (ASR) with limited parallel data. It is then
natural to investigate the existence of sparse and transferrable subnetworks in
pre-trained speech SSL models that can achieve even better low-resource ASR
performance. However, directly applying widely adopted pruning methods such as
the Lottery Ticket Hypothesis (LTH) is suboptimal in the computational cost
needed. Moreover, contrary to what LTH predicts, the discovered subnetworks
yield minimal performance gain compared to the original dense network. In this
work, we propose Prune-Adjust- Re-Prune (PARP), which discovers and finetunes
subnetworks for much better ASR performance, while only requiring a single
downstream finetuning run. PARP is inspired by our surprising observation that
subnetworks pruned for pre-training tasks only needed to be slightly adjusted
to achieve a sizeable performance boost in downstream ASR tasks. Extensive
experiments on low-resource English and multi-lingual ASR show (1) sparse
subnetworks exist in pre-trained speech SSL, and (2) the computational
advantage and performance gain of PARP over baseline pruning methods. On the
10min Librispeech split without LM decoding, PARP discovers subnetworks from
wav2vec 2.0 with an absolute 10.9%/12.6% WER decrease compared to the full
model. We demonstrate PARP mitigates performance degradation in cross-lingual
mask transfer, and investigate the possibility of discovering a single
subnetwork for 10 spoken languages in one run.
Related papers
- Open Implementation and Study of BEST-RQ for Speech Processing [25.678292575349648]
BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has shown great performance on Automatic Speech Recognition (ASR)
We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
arXiv Detail & Related papers (2024-05-07T13:11:37Z) - Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models [53.87983344862402]
Large-scale language models (PLMs) are inefficient in terms of memory footprint and computation.
PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data.
Recent studies show that sparseworks can be replaced with sparseworks without hurting the performance.
arXiv Detail & Related papers (2022-10-11T07:26:34Z) - Lottery Pools: Winning More by Interpolating Tickets without Increasing
Training or Inference Cost [28.70692607078139]
Lottery tickets (LTs) is able to discover accurate and sparseworks that could be trained in isolation to match the performance of dense networks.
We show that our method achieves significant performance gains in both, in-distribution and out-of-distribution scenarios.
arXiv Detail & Related papers (2022-08-23T09:50:55Z) - PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-modeing Unit
Training for Robust Uyghur E2E Speech Recognition [5.412341237841356]
Consonant and vowel reduction might cause performance degradation in Uyghur automatic speech recognition.
We propose multi-modeling unit training (MMUT) architecture fusion with PMT LibriPM-MMUT to boost the performance of PMT.
Experi-mental results on Uyghur ASR show that the proposed approaches improve significantly, outperforming the pure PMT.
arXiv Detail & Related papers (2021-12-13T15:04:33Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Towards Demystifying Representation Learning with Non-contrastive
Self-supervision [82.80118139087676]
Non-contrastive methods of self-supervised learning learn representations by minimizing the distance between two views of the same image.
Tian el al. (2021) made an initial attempt on the first question and proposed DirectPred that sets the predictor directly.
We show that in a simple linear network, DirectSet($alpha$) provably learns a desirable projection matrix and also reduces the sample complexity on downstream tasks.
arXiv Detail & Related papers (2021-10-11T00:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.