To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation
- URL: http://arxiv.org/abs/2406.04512v1
- Date: Thu, 6 Jun 2024 21:11:53 GMT
- Title: To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation
- Authors: Abdul Waheed, Karima Kadaoui, Muhammad Abdul-Mageed,
- Abstract summary: Current multilingual ASR models are compute-intensive and lack proper comprehensive evaluations.
We distill knowledge from large teacher models into smaller student variants that are more efficient.
Our best-distilled model's overall performance ($45.0$% WER) surpasses that of a SoTA model twice its size.
- Score: 16.655022975392992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Arabic is known to present unique challenges for Automatic Speech Recognition (ASR). On one hand, its rich linguistic diversity and wide range of dialects complicate the development of robust, inclusive models. On the other, current multilingual ASR models are compute-intensive and lack proper comprehensive evaluations. In light of these challenges, we distill knowledge from large teacher models into smaller student variants that are more efficient. We also introduce a novel human-annotated dataset covering five under-represented Arabic dialects for evaluation. We further evaluate both our models and existing SoTA multilingual models on both standard available benchmarks and our new dialectal data. Our best-distilled model's overall performance ($45.0$\% WER) surpasses that of a SoTA model twice its size (SeamlessM4T-large-v2, WER=$47.0$\%) and its teacher model (Whisper-large-v2, WER=$55.1$\%), and its average performance on our new dialectal data ($56.9$\% WER) outperforms all other models. To gain more insight into the poor performance of these models on dialectal data, we conduct an error analysis and report the main types of errors the different models tend to make. The GitHub repository for the project is available at \url{https://github.com/UBC-NLP/distill-whisper-ar}.
Related papers
- FairPIVARA: Reducing and Assessing Biases in CLIP-Based Multimodal Models [5.748694060126043]
We evaluate four different types of discriminatory practices within visual-language models.
We introduce FairPIthera, a method to reduce them by removing the most affected dimensions of feature embeddings.
The application of FairPIthera has led to a significant reduction of up to 98% in observed biases.
arXiv Detail & Related papers (2024-09-28T22:49:22Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - ChatGPT for Arabic Grammatical Error Correction [5.945320097465418]
Large language models (LLMs) fine-tuned to follow human instruction have exhibited significant capabilities in English NLP tasks.
In this paper, we delve into abilities of instruction fine-tuned LLMs in Arabic GEC, a task made complex due to Arabic's rich morphology.
We find that instruction fine-tuned models, regardless of their size, significantly underperform compared to fully fine-tuned models of significantly smaller sizes.
arXiv Detail & Related papers (2023-08-08T18:00:39Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Parameter and Data Efficient Continual Pre-training for Robustness to
Dialectal Variance in Arabic [9.004920233490642]
We show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model.
We then explore two continual pre-training methods-- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function.
arXiv Detail & Related papers (2022-11-08T02:51:57Z) - Distilling a Pretrained Language Model to a Multilingual ASR Model [3.4012007729454816]
We distill the rich knowledge embedded inside a well-trained teacher text model to the student speech model.
We show the superiority of our method on 20 low-resource languages of the CommonVoice dataset with less than 100 hours of speech data.
arXiv Detail & Related papers (2022-06-25T12:36:11Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Collective Wisdom: Improving Low-resource Neural Machine Translation
using Adaptive Knowledge Distillation [42.38435539241788]
Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios.
We propose an adaptive knowledge distillation approach to dynamically adjust the contribution of the teacher models during the distillation process.
Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-10-12T04:26:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.