Pre-trained Summarization Distillation
- URL: http://arxiv.org/abs/2010.13002v2
- Date: Wed, 28 Oct 2020 04:47:59 GMT
- Title: Pre-trained Summarization Distillation
- Authors: Sam Shleifer and Alexander M. Rush
- Abstract summary: Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
- Score: 121.14806854092672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent state-of-the-art approaches to summarization utilize large pre-trained
Transformer models. Distilling these models to smaller student models has
become critically important for practical use; however there are many different
distillation methods proposed by the NLP literature. Recent work on distilling
BERT for classification and regression tasks shows strong performance using
direct knowledge distillation. Alternatively, machine translation practitioners
distill using pseudo-labeling, where a small model is trained on the
translations of a larger model. A third, simpler approach is to 'shrink and
fine-tune' (SFT), which avoids any explicit distillation by copying parameters
to a smaller student model and then fine-tuning. We compare these three
approaches for distillation of Pegasus and BART, the current and former state
of the art, pre-trained summarization models, and find that SFT outperforms
knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but
under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch
Code and checkpoints of different sizes are available through Hugging Face
transformers here http://tiny.cc/4iy0tz.
Related papers
- Tiny models from tiny data: Textual and null-text inversion for few-shot distillation [11.80626524879555]
Few-shot image classification involves classifying images using very few training examples.
Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference.
We present a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion.
arXiv Detail & Related papers (2024-06-05T11:01:42Z) - Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification [0.0]
Main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy.
Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method.
arXiv Detail & Related papers (2024-03-25T19:15:19Z) - Generic-to-Specific Distillation of Masked Autoencoders [119.21281960831651]
We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders.
With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T17:13:14Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Referee: Reference-Free Sentence Summarization with Sharper
Controllability through Symbolic Knowledge Distillation [72.70058049274664]
We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision)
Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation.
arXiv Detail & Related papers (2022-10-25T07:07:54Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Learning to Generate Synthetic Training Data using Gradient Matching and
Implicit Differentiation [77.34726150561087]
This article explores various data distillation techniques that can reduce the amount of data required to successfully train deep networks.
Inspired by recent ideas, we suggest new data distillation techniques based on generative teaching networks, gradient matching, and the Implicit Function Theorem.
arXiv Detail & Related papers (2022-03-16T11:45:32Z) - Attention Temperature Matters in Abstractive Summarization Distillation [43.12920043942568]
This paper aims to distill large sequence-to-sequence Transformer models into smaller ones for faster inference and minimal performance loss.
We find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models.
arXiv Detail & Related papers (2021-06-07T09:18:21Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.