LightPAFF: A Two-Stage Distillation Framework for Pre-training and
Fine-tuning
- URL: http://arxiv.org/abs/2004.12817v1
- Date: Mon, 27 Apr 2020 14:00:09 GMT
- Title: LightPAFF: A Two-Stage Distillation Framework for Pre-training and
Fine-tuning
- Authors: Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu and
Tie-Yan Liu
- Abstract summary: LightPAFF uses two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model.
LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.
- Score: 146.51221523793342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While pre-training and fine-tuning, e.g., BERT~\citep{devlin2018bert},
GPT-2~\citep{radford2019language}, have achieved great success in language
understanding and generation tasks, the pre-trained models are usually too big
for online deployment in terms of both memory cost and inference speed, which
hinders them from practical online usage. In this paper, we propose LightPAFF,
a Lightweight Pre-training And Fine-tuning Framework that leverages two-stage
knowledge distillation to transfer knowledge from a big teacher model to a
lightweight student model in both pre-training and fine-tuning stages. In this
way the lightweight model can achieve similar accuracy as the big teacher
model, but with much fewer parameters and thus faster online inference speed.
LightPAFF can support different pre-training methods (such as BERT, GPT-2 and
MASS~\citep{song2019mass}) and be applied to many downstream tasks. Experiments
on three language understanding tasks, three language modeling tasks and three
sequence to sequence generation tasks demonstrate that while achieving similar
accuracy with the big BERT, GPT-2 and MASS models, LightPAFF reduces the model
size by nearly 5x and improves online inference speed by 5x-7x.
Related papers
- LIONs: An Empirically Optimized Approach to Align Language Models [31.225180404295536]
We conduct a rigorous analysis over a three-stage training pipeline consisting of supervised fine-tuning, offline preference learning, and online preference learning.
We have found that using techniques like sequence packing, loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of language models.
arXiv Detail & Related papers (2024-07-09T04:34:39Z) - Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models [46.42092771753465]
We introduce Point-PEFT, a novel framework for adapting point cloud pre-trained models with minimal learnable parameters.
Specifically, for a pre-trained 3D model, we freeze most of its parameters, and only tune the newly added PEFT modules on downstream tasks.
arXiv Detail & Related papers (2023-10-04T16:49:36Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - LiST: Lite Self-training Makes Efficient Few-shot Learners [91.28065455714018]
LiST improves by 35% over classic fine-tuning methods and 6% over prompt-tuning with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each target domain.
arXiv Detail & Related papers (2021-10-12T18:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.