Related papers: Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

URL: http://arxiv.org/abs/2401.11700v1
Date: Mon, 22 Jan 2024 05:46:11 GMT
Title: Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers
Authors: Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu, Yusuke Fujita
Abstract summary: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding.
Score: 19.812986973537143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. By using the intermediate layers as distillation target, we can more effectively distil LM knowledge into the lower network layers. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding. Experiments on the LibriSpeech dataset demonstrate the effectiveness of our approach in enhancing greedy decoding with connectionist temporal classification (CTC).

Related papers

OAL: Enhancing OOD Detection Using Latent Diffusion [5.357756138014614]
Outlier Aware Learning (OAL) framework synthesizes OOD training data directly in the latent space. We introduce a mutual information-based contrastive learning approach that amplifies the distinction between In-Distribution (ID) and collected OOD features.
arXiv Detail & Related papers (2024-06-24T11:01:43Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems. Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs) We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise Attention and Gaussian Mixture Model [33.446533426654995]
We design a heterogeneous knowledge-infused framework to strengthen BERT-based CSC models. We propose a novel form of n-gram-based layerwise self-attention to generate a multilayer representation. Experimental results show that our proposed framework yields a stable performance boost over four strong baseline models.
arXiv Detail & Related papers (2023-12-27T16:11:07Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
LEAD: Liberal Feature-based Distillation for Dense Retrieval [67.48820723639601]
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model. Traditional methods include response-based methods and feature-based methods. In this paper, we propose a liberal feature-based distillation method (LEAD)
arXiv Detail & Related papers (2022-12-10T06:30:54Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach [12.74181185088531]
Cross-modal knowledge distillation is a major topic of speech recognition research. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation. We extend the proposed method to a hierarchical distillation method using LMs trained in different units.
arXiv Detail & Related papers (2021-10-20T08:42:10Z)
MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation [9.91548921801095]
We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation. We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines.
arXiv Detail & Related papers (2021-05-12T19:11:34Z)
BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance [25.229624487344186]
High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. We propose a novel BERT distillation method based on many-to-many layer mapping. Our model can learn from different teacher layers adaptively for various NLP tasks.
arXiv Detail & Related papers (2020-10-13T02:53:52Z)
MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation [153.56211546576978]
In this work, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator. We can employ the meta-learning technique to optimize this label generator. The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012.
arXiv Detail & Related papers (2020-08-27T13:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.