DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and
Quantization
- URL: http://arxiv.org/abs/2203.11239v1
- Date: Mon, 21 Mar 2022 18:04:25 GMT
- Title: DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and
Quantization
- Authors: Zheng Li, Zijian Wang, Ming Tan, Ramesh Nallapati, Parminder Bhatia,
Andrew Arnold, Bing Xiang, Dan Roth
- Abstract summary: Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks.
These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency.
We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
- Score: 75.72231742114951
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve
state-of-the-art performance on many generative NLP tasks. However, such models
pose a great challenge in resource-constrained scenarios owing to their large
memory requirements and high latency. To alleviate this issue, we propose to
jointly distill and quantize the model, where knowledge is transferred from the
full-precision teacher model to the quantized and distilled low-precision
student model. Empirical analyses show that, despite the challenging nature of
generative tasks, we were able to achieve a 16.5x model footprint compression
ratio with little performance drop relative to the full-precision counterparts
on multiple summarization and QA datasets. We further pushed the limit of
compression ratio to 27.7x and presented the performance-efficiency trade-off
for generative tasks using pre-trained models. To the best of our knowledge,
this is the first work aiming to effectively distill and quantize
sequence-to-sequence pre-trained models for language generation tasks.
Related papers
- Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation [10.48108719012248]
We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model.
In contrast to much of the previous work, we scale up the parameters of the student model during training.
arXiv Detail & Related papers (2024-11-10T12:40:59Z) - Efficient Point Cloud Classification via Offline Distillation Framework and Negative-Weight Self-Distillation Technique [46.266960248570086]
We introduce an innovative offline recording strategy that avoids the simultaneous loading of both teacher and student models.
This approach feeds a multitude of augmented samples into the teacher model, recording both the data augmentation parameters and the corresponding logit outputs.
Experimental results demonstrate that the proposed distillation strategy enables the student model to achieve performance comparable to state-of-the-art models.
arXiv Detail & Related papers (2024-09-03T16:12:12Z) - Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task.
Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models.
The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - Zero-shot Adversarial Quantization [11.722728148523366]
We propose a zero-shot adversarial quantization (ZAQ) framework, facilitating effective discrepancy estimation and knowledge transfer.
This is achieved by a novel two-level discrepancy modeling to drive a generator to synthesize informative and diverse data examples.
We conduct extensive experiments on three fundamental vision tasks, demonstrating the superiority of ZAQ over the strong zero-shot baselines.
arXiv Detail & Related papers (2021-03-29T01:33:34Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.