Related papers: Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models

Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models

URL: http://arxiv.org/abs/2111.03664v4
Date: Fri, 11 Aug 2023 16:15:45 GMT
Title: Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models
Authors: Ji Won Yoon, Hyung Yong Kim, Hyeonseung Lee, Sunghwan Ahn, and Nam Soo Kim
Abstract summary: We introduce a new type of teacher model for connectionist temporal classification ( CTC)-based sequence models, namely Oracle Teacher. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution.
Score: 10.941519846908697
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution and thus enables utilizing both source and target inputs for model training. Extensive experiments are conducted on two sequence learning tasks: speech recognition and scene text recognition. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time.

Related papers

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models [46.61443903356605]
We introduce ActiveKD, a framework that integrates active learning with knowledge distillation.<n>We propose PCoreSet, a selection strategy that maximizes coverage in the probability space rather than the feature space.<n>PCoreSet consistently outperforms existing selection methods within the ActiveKD framework.
arXiv Detail & Related papers (2025-06-01T08:54:37Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM) It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Lightweight Self-Knowledge Distillation with Multi-source Information Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models. We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher. We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Knowledge Distillation with Representative Teacher Keys Based on Attention Mechanism for Image Classification Model Compression [1.503974529275767]
knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters. Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK) Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
arXiv Detail & Related papers (2022-06-26T05:08:50Z)
Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
ALM-KD: Knowledge Distillation with noisy labels via adaptive loss mixing [25.49637460661711]
Knowledge distillation is a technique where the outputs of a pretrained model are used for training a student model in a supervised setting. We tackle this problem via the use of an adaptive loss mixing scheme during KD. We demonstrate performance gains obtained using our approach in the standard KD setting as well as in multi-teacher and self-distillation settings.
arXiv Detail & Related papers (2022-02-07T14:53:22Z)
Distilling EEG Representations via Capsules for Affective Computing [14.67085109524245]
We propose a novel knowledge distillation pipeline to distill EEG representations via capsule-based architectures. Our framework consistently enables student networks with different compression ratios to effectively learn from the teacher. Our method achieves state-of-the-art results on one of the two datasets.
arXiv Detail & Related papers (2021-04-30T22:04:35Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
Distilling Knowledge from Graph Convolutional Networks [146.71503336770886]
Existing knowledge distillation methods focus on convolutional neural networks (CNNs) We propose the first dedicated approach to distilling knowledge from a pre-trained graph convolutional network (GCN) model. We show that our method achieves the state-of-the-art knowledge distillation performance for GCN models.
arXiv Detail & Related papers (2020-03-23T18:23:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.