A Comparative Analysis of Task-Agnostic Distillation Methods for
Compressing Transformer Language Models
- URL: http://arxiv.org/abs/2310.08797v1
- Date: Fri, 13 Oct 2023 01:00:15 GMT
- Title: A Comparative Analysis of Task-Agnostic Distillation Methods for
Compressing Transformer Language Models
- Authors: Takuma Udagawa, Aashka Trivedi, Michele Merler, Bishwaranjan
Bhattacharjee
- Abstract summary: We reproduce, compare and analyze several methods for task-agnostic (general-purpose) distillation of Transformer language models.
Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2.
- Score: 5.818750175599656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have become a vital component in modern NLP, achieving
state of the art performance in a variety of tasks. However, they are often
inefficient for real-world deployment due to their expensive inference costs.
Knowledge distillation is a promising technique to improve their efficiency
while retaining most of their effectiveness. In this paper, we reproduce,
compare and analyze several representative methods for task-agnostic
(general-purpose) distillation of Transformer language models. Our target of
study includes Output Distribution (OD) transfer, Hidden State (HS) transfer
with various layer mapping strategies, and Multi-Head Attention (MHA) transfer
based on MiniLMv2. Through our extensive experiments, we study the
effectiveness of each method for various student architectures in both
monolingual (English) and multilingual settings. Overall, we show that MHA
transfer based on MiniLMv2 is generally the best option for distillation and
explain the potential reasons behind its success. Moreover, we show that HS
transfer remains as a competitive baseline, especially under a sophisticated
layer mapping strategy, while OD transfer consistently lags behind other
approaches. Findings from this study helped us deploy efficient yet effective
student models for latency-critical applications.
Related papers
- LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - VideoAdviser: Video Knowledge Distillation for Multimodal Transfer
Learning [6.379202839994046]
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion.
We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model.
We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
arXiv Detail & Related papers (2023-09-27T08:44:04Z) - Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual
Machine Translation [8.7660229706359]
Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation.
We show effective regularization strategies that prevent over-fitting and improve the performance of MoE models on low-resource tasks.
arXiv Detail & Related papers (2022-12-15T01:06:55Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Weighted Ensemble Self-Supervised Learning [67.24482854208783]
Ensembling has proven to be a powerful technique for boosting model performance.
We develop a framework that permits data-dependent weighted cross-entropy losses.
Our method outperforms both in multiple evaluation metrics on ImageNet-1K.
arXiv Detail & Related papers (2022-11-18T02:00:17Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Incorporating Linguistic Knowledge for Abstractive Multi-document
Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model.
We process the dependency information into the linguistic-guided attention mechanism.
With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z) - Selective Knowledge Distillation for Neural Machine Translation [24.493705133103443]
knowledge distillation is widely applied to enhance the model's performance by transferring teacher model's knowledge on each training sample.
Previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge.
We propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation.
arXiv Detail & Related papers (2021-05-27T06:54:12Z) - Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot
Classification Benchmark [44.530605715850506]
Cross-family study of the best transfer and meta learners on a large-scale meta-learning benchmark and a transfer learning benchmark.
We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet.
We reveal a number of discrepancies in evaluation norms and study some of these in light of the performance gap.
arXiv Detail & Related papers (2021-04-06T16:17:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.