Keep Decoding Parallel with Effective Knowledge Distillation from
Language Models to End-to-end Speech Recognisers
- URL: http://arxiv.org/abs/2401.11700v1
- Date: Mon, 22 Jan 2024 05:46:11 GMT
- Title: Keep Decoding Parallel with Effective Knowledge Distillation from
Language Models to End-to-end Speech Recognisers
- Authors: Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu, Yusuke Fujita
- Abstract summary: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers.
Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer.
Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding.
- Score: 19.812986973537143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study presents a novel approach for knowledge distillation (KD) from a
BERT teacher model to an automatic speech recognition (ASR) model using
intermediate layers. To distil the teacher's knowledge, we use an attention
decoder that learns from BERT's token probabilities. Our method shows that
language model (LM) information can be more effectively distilled into an ASR
model using both the intermediate layers and the final layer. By using the
intermediate layers as distillation target, we can more effectively distil LM
knowledge into the lower network layers. Using our method, we achieve better
recognition accuracy than with shallow fusion of an external LM, allowing us to
maintain fast parallel decoding. Experiments on the LibriSpeech dataset
demonstrate the effectiveness of our approach in enhancing greedy decoding with
connectionist temporal classification (CTC).
Related papers
- Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise
Attention and Gaussian Mixture Model [33.446533426654995]
We design a heterogeneous knowledge-infused framework to strengthen BERT-based CSC models.
We propose a novel form of n-gram-based layerwise self-attention to generate a multilayer representation.
Experimental results show that our proposed framework yields a stable performance boost over four strong baseline models.
arXiv Detail & Related papers (2023-12-27T16:11:07Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - LEAD: Liberal Feature-based Distillation for Dense Retrieval [67.48820723639601]
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model.
Traditional methods include response-based methods and feature-based methods.
In this paper, we propose a liberal feature-based distillation method (LEAD)
arXiv Detail & Related papers (2022-12-10T06:30:54Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Knowledge distillation from language model to acoustic model: a
hierarchical multi-task learning approach [12.74181185088531]
Cross-modal knowledge distillation is a major topic of speech recognition research.
We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation.
We extend the proposed method to a hierarchical distillation method using LMs trained in different units.
arXiv Detail & Related papers (2021-10-20T08:42:10Z) - MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation [9.91548921801095]
We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation.
We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines.
arXiv Detail & Related papers (2021-05-12T19:11:34Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance [25.229624487344186]
High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices.
We propose a novel BERT distillation method based on many-to-many layer mapping.
Our model can learn from different teacher layers adaptively for various NLP tasks.
arXiv Detail & Related papers (2020-10-13T02:53:52Z) - MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down
Distillation [153.56211546576978]
In this work, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator.
We can employ the meta-learning technique to optimize this label generator.
The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012.
arXiv Detail & Related papers (2020-08-27T13:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.