Related papers: Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach

Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach

URL: http://arxiv.org/abs/2110.10429v1
Date: Wed, 20 Oct 2021 08:42:10 GMT
Title: Knowledge distillation from language model to acoustic model: a hierarchical multi-task learning approach
Authors: Mun-Hak Lee, Joon-Hyuk Chang
Abstract summary: Cross-modal knowledge distillation is a major topic of speech recognition research. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation. We extend the proposed method to a hierarchical distillation method using LMs trained in different units.
Score: 12.74181185088531
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech recognition systems with massive deep learning-based LMs is a major topic of speech recognition research. Among the various methods of applying LMs to speech recognition systems, in this paper, we focus on a cross-modal knowledge distillation method that transfers knowledge between two types of deep neural networks with different modalities. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation and demonstrate that the proposed method effectively compensates for the shortcomings of the existing label-interpolation-based distillation method. In addition, we extend the proposed method to a hierarchical distillation method using LMs trained in different units (senones, monophones, and subwords) and reveal the effectiveness of the hierarchical distillation method through an ablation study.

Related papers

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning [54.73049408950049]
We propose a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning.<n>Our approach improves unified multimodal retrieval from both structural and learning perspectives.
arXiv Detail & Related papers (2025-07-10T16:47:25Z)
Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM [53.17360668423001]
Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation.<n>This work proposes a speaker-aware progressive OSD model that leverages a progressive training strategy to enhance the correlation between subtasks.<n> Experimental results show that the proposed method achieves state-of-the-art performance, with an F1 score of 82.76% on the AMI test set.
arXiv Detail & Related papers (2025-05-29T07:47:48Z)
Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers [19.812986973537143]
This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding.
arXiv Detail & Related papers (2024-01-22T05:46:11Z)
Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z)
Adaptive Knowledge Distillation between Text and Speech Pre-trained Models [30.125690848883455]
Prior-informed Adaptive knowledge Distillation (PAD) is more effective in transferring linguistic knowledge than other metric-based distillation approaches. This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data. We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
arXiv Detail & Related papers (2023-03-07T02:31:57Z)
Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years. We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z)
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation [22.733285434532068]
Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks. We propose the hierarchical knowledge distillation (HKD) on the continuous integrate-and-fire (CIF) based ASR models. Compared with the original CIF-based model, our method achieves 15% and 9% relative error rate reduction on the AISHELL-1 and LibriSpeech datasets.
arXiv Detail & Related papers (2023-01-30T15:44:55Z)
Evaluation of Self-taught Learning-based Representations for Facial Emotion Recognition [62.30451764345482]
This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition. The idea is to create complementary representations promoting diversity by varying the autoencoders' initialization, architecture, and training data. Experimental results on Jaffe and Cohn-Kanade datasets using a leave-one-subject-out protocol show that FER methods based on the proposed diverse representations compare favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2022-04-26T22:48:15Z)
Audio-visual multi-channel speech separation, dereverberation and recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z)
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method. We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
A Review of Sound Source Localization with Deep Learning Methods [71.18444724397486]
This article is a review on deep learning methods for single and multiple sound source localization. We provide an exhaustive topography of the neural-based localization literature in this context. Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
arXiv Detail & Related papers (2021-09-08T07:25:39Z)
Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model. We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z)
Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition [14.3760318387958]
We propose an extension of multi-teacher distillation methods to joint CTC-attention end-to-end ASR systems. The core intuition behind them is to integrate the error rate metric to the teacher selection rather than solely focusing on the observed losses. We evaluate these strategies under a selection of training procedures on different datasets.
arXiv Detail & Related papers (2020-05-19T09:24:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.