Related papers: uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling

uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling

URL: http://arxiv.org/abs/2407.01257v2
Date: Wed, 3 Jul 2024 09:54:08 GMT
Title: uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling
Authors: Abdul Waheed, Karima Kadaoui, Muhammad Abdul-Mageed,
Abstract summary: We show that it is possible to distill Whisper models into relatively small models without using any labeled data. Our models are 25-50% more compute and memory efficient while maintaining performance equal to or better than the teacher model.
Score: 16.655022975392992
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work on distilling Whisper's knowledge into small models using pseudo-labels shows promising performance while reducing the size by up to 50\%. This results in small, efficient, and dedicated models. However, a critical step of distillation from pseudo-labels involves filtering high-quality predictions and using only those during training. This step requires ground truth to compare and filter bad examples making the whole process supervised. In addition to that, the distillation process requires a large amount of data thereby limiting the ability to distil models in low-resource settings. To address this challenge, we propose an unsupervised or label-free framework for distillation, thus eliminating the requirement for labeled data altogether. Through experimentation, we show that our best-distilled models outperform the teacher model by 5-7 points in terms of WER. Additionally, our models are on par with or better than similar supervised data filtering setup. When we scale the data, our models significantly outperform all zero-shot and supervised models. We demonstrate that it is possible to distill large Whisper models into relatively small models without using any labeled data. Our distilled models are 25-50\% more compute and memory efficient while maintaining performance equal to or better than the teacher model.

Related papers

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [61.15402517835137]
We build a supervised fine-tuning (SFT) dataset to achieve state-of-the-art coding capability results in models of various sizes. Our models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning.
arXiv Detail & Related papers (2025-04-02T17:50:31Z)
Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with Limited Clean Data [51.745219224707384]
Graph Neural Networks (GNNs) have achieved remarkable performance through their message-passing mechanism. Recent studies have highlighted the vulnerability of GNNs to backdoor attacks. In this paper, we propose a practical backdoor mitigation framework, denoted as GRAPHNAD.
arXiv Detail & Related papers (2025-01-10T10:16:35Z)
Training on the Test Model: Contamination in Ranking Distillation [14.753216172912968]
We investigate the effect of a contaminated teacher model in a distillation setting. We find that contamination occurs even when the test data represents a small fraction of the teacher's training samples.
arXiv Detail & Related papers (2024-11-04T17:11:14Z)
Tiny models from tiny data: Textual and null-text inversion for few-shot distillation [11.80626524879555]
Few-shot image classification involves classifying images using very few training examples. Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference. We present a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion.
arXiv Detail & Related papers (2024-06-05T11:01:42Z)
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset. We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z)
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models. We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)
Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data. Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z)
Data Distillation for Text Classification [7.473576666437028]
Data distillation aims to distill the knowledge from a large training dataset down to a smaller and synthetic one. We develop a novel data distillation method for text classification. The results that the distilled data with the size of 0.1% of the original text data achieves approximately 90% performance of the original is rather impressive.
arXiv Detail & Related papers (2021-04-17T04:54:54Z)
Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones [40.33419553042038]
We propose to improve existing baseline networks via knowledge distillation from off-the-shelf pre-trained big powerful models. Our solution performs distillation by only driving prediction of the student model consistent with that of the teacher model. We empirically find that such simple distillation settings perform extremely effective, for example, the top-1 accuracy on ImageNet-1k validation set of MobileNetV3-large and ResNet50-D can be significantly improved.
arXiv Detail & Related papers (2021-03-10T09:32:44Z)
Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
New Properties of the Data Distillation Method When Working With Tabular Data [77.34726150561087]
Data distillation is the problem of reducing the volume oftraining data while keeping only the necessary information. We show that the model trained on distilled samples can outperform the model trained on the original dataset.
arXiv Detail & Related papers (2020-10-19T20:27:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.