ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic
Distillation Generalization
- URL: http://arxiv.org/abs/2301.03416v1
- Date: Mon, 9 Jan 2023 15:12:50 GMT
- Title: ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic
Distillation Generalization
- Authors: Weixin Liu, Xuyi Chen, Jiaxiang Liu, Shikun Feng, Yu Sun, Hao Tian,
Hua Wu
- Abstract summary: Task-agnostic knowledge distillation attempts to address the problem of deploying large pretrained language model in resource-constrained scenarios.
We show that we can leverage multi-task learning in task-agnostic distillation to advance the generalization of the resulted student.
- Score: 36.338614215561805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Task-agnostic knowledge distillation attempts to address the problem of
deploying large pretrained language model in resource-constrained scenarios by
compressing a large pretrained model called teacher into a smaller one called
student such that the student can be directly finetuned on downstream tasks and
retains comparable performance. However, we empirically find that there is a
generalization gap between the student and the teacher in existing methods. In
this work, we show that we can leverage multi-task learning in task-agnostic
distillation to advance the generalization of the resulted student. In
particular, we propose Multi-task Infused Task-agnostic Knowledge Distillation
(MITKD). We first enhance the teacher by multi-task training it on multiple
downstream tasks and then perform distillation to produce the student.
Experimental results demonstrate that our method yields a student with much
better generalization, significantly outperforms existing baselines, and
establishes a new state-of-the-art result on in-domain, out-domain, and
low-resource datasets in the setting of task-agnostic distillation. Moreover,
our method even exceeds an 8x larger BERT$_{\text{Base}}$ on SQuAD and four
GLUE tasks. In addition, by combining ERNIE 3.0, our method achieves
state-of-the-art results on 10 Chinese datasets.
Related papers
- On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion [23.63688816017186]
Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge.
We propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task.
Our method closes the performance gap by 96.4% in single-task scenarios and by 86.3% in multi-task scenarios.
arXiv Detail & Related papers (2024-06-17T03:07:41Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale
Multitask Learning Systems [4.675744559395732]
Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer.
State of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks.
We propose an evolutionary method that can generate a large scale multitask model and can support the dynamic and continuous addition of new tasks.
arXiv Detail & Related papers (2022-05-25T13:10:47Z) - Task Adaptive Parameter Sharing for Multi-Task Learning [114.80350786535952]
Adaptive Task Adapting Sharing (TAPS) is a method for tuning a base model to a new task by adaptively modifying a small, task-specific subset of layers.
Compared to other methods, TAPS retains high accuracy on downstream tasks while introducing few task-specific parameters.
We evaluate our method on a suite of fine-tuning tasks and architectures (ResNet, DenseNet, ViT) and show that it achieves state-of-the-art performance while being simple to implement.
arXiv Detail & Related papers (2022-03-30T23:16:07Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation [80.18830380517753]
We develop a new task-agnostic distillation framework XtremeDistilTransformers.
We study the transferability of several source tasks, augmentation resources and model architecture for distillation.
arXiv Detail & Related papers (2021-06-08T17:49:33Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.