Ultra Fast Speech Separation Model with Teacher Student Learning
- URL: http://arxiv.org/abs/2204.12777v1
- Date: Wed, 27 Apr 2022 09:02:45 GMT
- Title: Ultra Fast Speech Separation Model with Teacher Student Learning
- Authors: Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu,
Jinyu Li, Xiangzhan Yu
- Abstract summary: An ultra fast Transformer model is proposed to achieve better performance and efficiency with teacher student learning (T-S learning)
Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation.
- Score: 44.71171732510265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has been successfully applied to speech separation recently with
its strong long-dependency modeling capacity using a self-attention mechanism.
However, Transformer tends to have heavy run-time costs due to the deep encoder
layers, which hinders its deployment on edge devices. A small Transformer model
with fewer encoder layers is preferred for computational efficiency, but it is
prone to performance degradation. In this paper, an ultra fast speech
separation Transformer model is proposed to achieve both better performance and
efficiency with teacher student learning (T-S learning). We introduce
layer-wise T-S learning and objective shifting mechanisms to guide the small
student model to learn intermediate representations from the large teacher
model. Compared with the small Transformer model trained from scratch, the
proposed T-S learning method reduces the word error rate (WER) by more than 5%
for both multi-channel and single-channel speech separation on LibriCSS
dataset. Utilizing more unlabeled speech data, our ultra fast speech separation
models achieve more than 10% relative WER reduction.
Related papers
- Recycle-and-Distill: Universal Compression Strategy for
Transformer-based Speech SSL Models with Attention Map Reusing and Masking
Distillation [32.97898981684483]
Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks.
Huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies.
arXiv Detail & Related papers (2023-05-19T14:07:43Z) - Structured Pruning of Self-Supervised Pre-trained Models for Speech
Recognition and Understanding [43.68557263195205]
Self-supervised speech representation learning (SSL) has shown to be effective in various downstream tasks, but SSL models are usually large and slow.
We propose three task-specific structured pruning methods to deal with such heterogeneous networks.
Experiments on LibriSpeech and SLURP show that the proposed method is more accurate than the original wav2vecbase with 10% to 30% less, and is able to reduce the computation by 40% to 50% without any degradation.
arXiv Detail & Related papers (2023-02-27T20:39:54Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - LegoNet: A Fast and Exact Unlearning Architecture [59.49058450583149]
Machine unlearning aims to erase the impact of specific training samples upon deleted requests from a trained model.
We present a novel network, namely textitLegoNet, which adopts the framework of fixed encoder + multiple adapters''
We show that LegoNet accomplishes fast and exact unlearning while maintaining acceptable performance, synthetically outperforming unlearning baselines.
arXiv Detail & Related papers (2022-10-28T09:53:05Z) - Multi-stage Progressive Compression of Conformer Transducer for
On-device Speech Recognition [7.450574974954803]
Small memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models.
Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size.
We propose a multi-stage progressive approach to compress the conformer transducer model using KD.
arXiv Detail & Related papers (2022-10-01T02:23:00Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.