Distilling Knowledge from Ensembles of Acoustic Models for Joint
CTC-Attention End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2005.09310v3
- Date: Sun, 4 Jul 2021 02:15:21 GMT
- Title: Distilling Knowledge from Ensembles of Acoustic Models for Joint
CTC-Attention End-to-End Speech Recognition
- Authors: Yan Gao, Titouan Parcollet, Nicholas Lane
- Abstract summary: We propose an extension of multi-teacher distillation methods to joint CTC-attention end-to-end ASR systems.
The core intuition behind them is to integrate the error rate metric to the teacher selection rather than solely focusing on the observed losses.
We evaluate these strategies under a selection of training procedures on different datasets.
- Score: 14.3760318387958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation has been widely used to compress existing deep
learning models while preserving the performance on a wide range of
applications. In the specific context of Automatic Speech Recognition (ASR),
distillation from ensembles of acoustic models has recently shown promising
results in increasing recognition performance. In this paper, we propose an
extension of multi-teacher distillation methods to joint CTC-attention
end-to-end ASR systems. We also introduce three novel distillation strategies.
The core intuition behind them is to integrate the error rate metric to the
teacher selection rather than solely focusing on the observed losses. In this
way, we directly distill and optimize the student toward the relevant metric
for speech recognition. We evaluate these strategies under a selection of
training procedures on different datasets (TIMIT, Librispeech, Common Voice)
and various languages (English, French, Italian). In particular,
state-of-the-art error rates are reported on the Common Voice French, Italian
and TIMIT datasets.
Related papers
- Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Efficient Verified Machine Unlearning For Distillation [6.363158395541767]
PURGE (Partitioned Unlearning with Retraining Guarantee for Ensembles) is a novel framework integrating verified unlearning with distillation.
We provide both theoretical analysis, quantifying significant speed-ups in the unlearning process, and empirical validation on multiple datasets.
arXiv Detail & Related papers (2025-03-28T15:38:07Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations [23.4909421082857]
EmoDistill is a novel framework to learn strong linguistic and prosodic representations of emotion from speech.
Our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers.
Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin.
arXiv Detail & Related papers (2023-09-09T17:30:35Z) - Adaptive Knowledge Distillation between Text and Speech Pre-trained
Models [30.125690848883455]
Prior-informed Adaptive knowledge Distillation (PAD) is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data.
We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.
arXiv Detail & Related papers (2023-03-07T02:31:57Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Knowledge Transfer from Pre-trained Language Models to Cif-based Speech
Recognizers via Hierarchical Distillation [22.733285434532068]
Large-scale pre-trained language models (PLMs) have shown great potential in natural language processing tasks.
We propose the hierarchical knowledge distillation (HKD) on the continuous integrate-and-fire (CIF) based ASR models.
Compared with the original CIF-based model, our method achieves 15% and 9% relative error rate reduction on the AISHELL-1 and LibriSpeech datasets.
arXiv Detail & Related papers (2023-01-30T15:44:55Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Knowledge distillation from language model to acoustic model: a
hierarchical multi-task learning approach [12.74181185088531]
Cross-modal knowledge distillation is a major topic of speech recognition research.
We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation.
We extend the proposed method to a hierarchical distillation method using LMs trained in different units.
arXiv Detail & Related papers (2021-10-20T08:42:10Z) - Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification [66.62686601948455]
We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
arXiv Detail & Related papers (2021-08-05T13:08:13Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.