Knowledge distillation from language model to acoustic model: a
hierarchical multi-task learning approach
- URL: http://arxiv.org/abs/2110.10429v1
- Date: Wed, 20 Oct 2021 08:42:10 GMT
- Title: Knowledge distillation from language model to acoustic model: a
hierarchical multi-task learning approach
- Authors: Mun-Hak Lee, Joon-Hyuk Chang
- Abstract summary: Cross-modal knowledge distillation is a major topic of speech recognition research.
We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation.
We extend the proposed method to a hierarchical distillation method using LMs trained in different units.
- Score: 12.74181185088531
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The remarkable performance of the pre-trained language model (LM) using
self-supervised learning has led to a major paradigm shift in the study of
natural language processing. In line with these changes, leveraging the
performance of speech recognition systems with massive deep learning-based LMs
is a major topic of speech recognition research. Among the various methods of
applying LMs to speech recognition systems, in this paper, we focus on a
cross-modal knowledge distillation method that transfers knowledge between two
types of deep neural networks with different modalities. We propose an acoustic
model structure with multiple auxiliary output layers for cross-modal
distillation and demonstrate that the proposed method effectively compensates
for the shortcomings of the existing label-interpolation-based distillation
method. In addition, we extend the proposed method to a hierarchical
distillation method using LMs trained in different units (senones, monophones,
and subwords) and reveal the effectiveness of the hierarchical distillation
method through an ablation study.
Related papers
- Keep Decoding Parallel with Effective Knowledge Distillation from
Language Models to End-to-end Speech Recognisers [19.812986973537143]
This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers.
Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer.
Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding.
arXiv Detail & Related papers (2024-01-22T05:46:11Z) - Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - Adaptive Knowledge Distillation between Text and Speech Pre-trained
Models [30.125690848883455]
Prior-informed Adaptive knowledge Distillation (PAD) is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data.
We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
arXiv Detail & Related papers (2023-03-07T02:31:57Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Knowledge Transfer from Pre-trained Language Models to Cif-based Speech
Recognizers via Hierarchical Distillation [22.733285434532068]
Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks.
We propose the hierarchical knowledge distillation (HKD) on the continuous integrate-and-fire (CIF) based ASR models.
Compared with the original CIF-based model, our method achieves 15% and 9% relative error rate reduction on the AISHELL-1 and LibriSpeech datasets.
arXiv Detail & Related papers (2023-01-30T15:44:55Z) - Evaluation of Self-taught Learning-based Representations for Facial
Emotion Recognition [62.30451764345482]
This work describes different strategies to generate unsupervised representations obtained through the concept of self-taught learning for facial emotion recognition.
The idea is to create complementary representations promoting diversity by varying the autoencoders' initialization, architecture, and training data.
Experimental results on Jaffe and Cohn-Kanade datasets using a leave-one-subject-out protocol show that FER methods based on the proposed diverse representations compare favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2022-04-26T22:48:15Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - A Review of Sound Source Localization with Deep Learning Methods [71.18444724397486]
This article is a review on deep learning methods for single and multiple sound source localization.
We provide an exhaustive topography of the neural-based localization literature in this context.
Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
arXiv Detail & Related papers (2021-09-08T07:25:39Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - Distilling Knowledge from Ensembles of Acoustic Models for Joint
CTC-Attention End-to-End Speech Recognition [14.3760318387958]
We propose an extension of multi-teacher distillation methods to joint CTC-attention end-to-end ASR systems.
The core intuition behind them is to integrate the error rate metric to the teacher selection rather than solely focusing on the observed losses.
We evaluate these strategies under a selection of training procedures on different datasets.
arXiv Detail & Related papers (2020-05-19T09:24:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.