Related papers: Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding

Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding

URL: http://arxiv.org/abs/2305.13899v2
Date: Mon, 31 Jul 2023 19:02:23 GMT
Title: Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding
Authors: Umberto Cappellazzo, Muqiao Yang, Daniele Falavigna, Alessio Brutti
Abstract summary: We tackle the problem of Spoken Language Understanding applied to a continual learning setting. We propose three knowledge distillation approaches to mitigate forgetting for a sequence-to-sequence transformer model.
Score: 10.187334662184314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to learn new concepts sequentially is a major weakness for modern neural networks, which hinders their use in non-stationary environments. Their propensity to fit the current data distribution to the detriment of the past acquired knowledge leads to the catastrophic forgetting issue. In this work we tackle the problem of Spoken Language Understanding applied to a continual learning setting. We first define a class-incremental scenario for the SLURP dataset. Then, we propose three knowledge distillation (KD) approaches to mitigate forgetting for a sequence-to-sequence transformer model: the first KD method is applied to the encoder output (audio-KD), and the other two work on the decoder output, either directly on the token-level (tok-KD) or on the sequence-level (seq-KD) distributions. We show that the seq-KD substantially improves all the performance metrics, and its combination with the audio-KD further decreases the average WER and enhances the entity prediction metric.

Related papers

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective [9.10299144143817]
Decoupled Knowledge Distillation (DKD) re-emphasizes the importance of logit knowledge through advanced decoupling and strategies.<n>We introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss.<n>We demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods.
arXiv Detail & Related papers (2025-12-04T09:56:25Z)
SEDEG:Sequential Enhancement of Decoder and Encoder's Generality for Class Incremental Learning with Small Memory [11.197556113382186]
In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs.<n>SEDEG trains an ensembled encoder through feature boosting to learn generalized representations.<n>The next stage involves using knowledge distillation strategies to compress the ensembled encoder and develop a new, more generalized encoder.
arXiv Detail & Related papers (2025-08-18T13:55:59Z)
EKPC: Elastic Knowledge Preservation and Compensation for Class-Incremental Learning [53.88000987041739]
Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time.<n>We propose the Elastic Knowledge Preservation and Compensation (EKPC) method, integrating Importance-aware importance Regularization (IPR) and Trainable Semantic Drift Compensation (TSDC) for CIL.
arXiv Detail & Related papers (2025-06-14T05:19:58Z)
DeepKD: A Deeply Decoupled and Denoised Knowledge Distillation Trainer [3.917354933232572]
DeepKD is a novel training framework that integrates dual-level decoupling with adaptive denoising.<n>We introduce a dynamic top-k mask (DTM) mechanism that gradually increases K from a small initial value to incorporate more non-target classes as training progresses.<n>Extensive experiments on CIFAR-100, ImageNet, and MS-COCO demonstrate DeepKD's effectiveness.
arXiv Detail & Related papers (2025-05-21T05:38:57Z)
A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization [15.8696301825572]
Continuously-trained deep neural networks (DNNs) must rapidly learn new concepts while preserving and utilizing prior knowledge. Weights for newly encountered categories are typically randomly, leading to high initial training loss (spikes) and instability. Inspired by Neural Collapse (NC), we propose a weight initialization strategy to improve learning efficiency in CL.
arXiv Detail & Related papers (2025-03-09T01:44:22Z)
SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training [68.7896349660824]
We present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT. Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Alignment (S++) framework. Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Alignment to align the disjoint classification layers in a post-hoc fashion.
arXiv Detail & Related papers (2024-08-15T17:50:07Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement [9.28943772676672]
Codeswitching phenomenon remains a major obstacle that hinders automatic speech recognition. We introduce a novel disentanglement loss to enable the lower-layer of the encoder to capture inter-lingual acoustic information. We verify that our proposed method outperforms the prior-art methods using pretrained dual-encoders.
arXiv Detail & Related papers (2024-02-27T04:08:59Z)
Fixed Random Classifier Rearrangement for Continual Learning [0.5439020425819]
In visual classification scenario, neural networks inevitably forget the knowledge of old tasks after learning new ones. We propose a continual learning algorithm named Fixed Random Rearrangement (FRCR)
arXiv Detail & Related papers (2024-02-23T09:43:58Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
DeCoR: Defy Knowledge Forgetting by Predicting Earlier Audio Codes [16.96483269023065]
Lifelong audio feature extraction involves learning new sound classes incrementally. optimizing the model only on new data can lead to catastrophic forgetting of previously learned tasks. This paper introduces a new approach to continual audio representation learning called DeCoR.
arXiv Detail & Related papers (2023-05-29T02:25:03Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation [53.06337011259031]
We introduce UnFuSeD, a novel approach to leverage self-supervised learning for audio classification. We use the encoder to generate pseudo-labels for unsupervised fine-tuning before the actual fine-tuning step. UnFuSeD achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all our baselines.
arXiv Detail & Related papers (2023-03-10T02:43:36Z)
An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for Spoken Language Understanding [9.447108578893639]
We consider the joint use of rehearsal and knowledge distillation approaches for spoken language understanding under a class-incremental learning scenario. We report on multiple KD combinations at different levels in the network, showing that combining feature-level and predictions-level KDs leads to the best results.
arXiv Detail & Related papers (2022-11-15T14:15:22Z)
EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur. We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data. We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z)
Continual Learning with Node-Importance based Adaptive Group Sparse Regularization [30.23319528662881]
We propose a novel regularization-based continual learning method, dubbed as Adaptive Group Sparsity based Continual Learning (AGS-CL) Our method selectively employs the two penalties when learning each node based its the importance, which is adaptively updated after learning each new task.
arXiv Detail & Related papers (2020-03-30T18:21:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.