Related papers: A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

URL: http://arxiv.org/abs/2310.08797v1
Date: Fri, 13 Oct 2023 01:00:15 GMT
Title: A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models
Authors: Takuma Udagawa, Aashka Trivedi, Michele Merler, Bishwaranjan Bhattacharjee
Abstract summary: We reproduce, compare and analyze several methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2.
Score: 5.818750175599656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.

Related papers

Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments. Previous research has introduced several distillation methods for both generating training data and for training the student model. Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z)
Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z)
LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z)
VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning [6.379202839994046]
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
arXiv Detail & Related papers (2023-09-27T08:44:04Z)
Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation [8.7660229706359]
Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. We show effective regularization strategies that prevent over-fitting and improve the performance of MoE models on low-resource tasks.
arXiv Detail & Related papers (2022-12-15T01:06:55Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
Weighted Ensemble Self-Supervised Learning [67.24482854208783]
Ensembling has proven to be a powerful technique for boosting model performance. We develop a framework that permits data-dependent weighted cross-entropy losses. Our method outperforms both in multiple evaluation metrics on ImageNet-1K.
arXiv Detail & Related papers (2022-11-18T02:00:17Z)
PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
Incorporating Linguistic Knowledge for Abstractive Multi-document Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model. We process the dependency information into the linguistic-guided attention mechanism. With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z)
Selective Knowledge Distillation for Neural Machine Translation [24.493705133103443]
knowledge distillation is widely applied to enhance the model's performance by transferring teacher model's knowledge on each training sample. Previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge. We propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation.
arXiv Detail & Related papers (2021-05-27T06:54:12Z)
Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark [44.530605715850506]
Cross-family study of the best transfer and meta learners on a large-scale meta-learning benchmark and a transfer learning benchmark. We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet. We reveal a number of discrepancies in evaluation norms and study some of these in light of the performance gap.
arXiv Detail & Related papers (2021-04-06T16:17:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.