Attention Temperature Matters in Abstractive Summarization Distillation
- URL: http://arxiv.org/abs/2106.03441v2
- Date: Tue, 8 Jun 2021 03:09:45 GMT
- Title: Attention Temperature Matters in Abstractive Summarization Distillation
- Authors: Shengqiang Zhang, Xingxing Zhang, Hangbo Bao, Furu Wei
- Abstract summary: This paper aims to distill large sequence-to-sequence Transformer models into smaller ones for faster inference and minimal performance loss.
We find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models.
- Score: 43.12920043942568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress of abstractive text summarization largely relies on large
pre-trained sequence-to-sequence Transformer models, which are computationally
expensive. This paper aims to distill these large models into smaller ones for
faster inference and minimal performance loss. Pseudo-labeling based methods
are popular in sequence-to-sequence model distillation. In this paper, we find
simply manipulating attention temperatures in Transformers can make pseudo
labels easier to learn for student models. Our experiments on three
summarization datasets show our proposed method consistently improves over
vanilla pseudo-labeling based methods. We also find that both the pseudo labels
and summaries produced by our students are shorter and more abstractive. We
will make our code and models publicly available.
Related papers
- Heat Death of Generative Models in Closed-Loop Learning [63.83608300361159]
We study the learning dynamics of generative models that are fed back their own produced content in addition to their original training dataset.
We show that, unless a sufficient amount of external data is introduced at each iteration, any non-trivial temperature leads the model to degenerate.
arXiv Detail & Related papers (2024-04-02T21:51:39Z) - Enhancing Abstractiveness of Summarization Models through Calibrated
Distillation [30.199051061633803]
DisCal is a novel approach to enhance the level of abstractiveness without sacrificing informativeness.
Our experiments show that DisCal outperforms prior methods in abstractive summarization distillation.
arXiv Detail & Related papers (2023-10-20T18:43:49Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Referee: Reference-Free Sentence Summarization with Sharper
Controllability through Symbolic Knowledge Distillation [72.70058049274664]
We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision)
Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation.
arXiv Detail & Related papers (2022-10-25T07:07:54Z) - LOPS: Learning Order Inspired Pseudo-Label Selection for Weakly
Supervised Text Classification [28.37907856670151]
Pseudo-labels are noisy due to their nature, so selecting the correct ones has a huge potential for performance boost.
We propose a novel pseudo-label selection method LOPS that memorize takes learning order of samples into consideration.
LOPS can be viewed as a strong performance-boost plug-in to most of existing weakly-supervised text classification methods.
arXiv Detail & Related papers (2022-05-25T06:46:48Z) - Learning from Noisy Labels for Entity-Centric Information Extraction [17.50856935207308]
We propose a simple co-regularization framework for entity-centric information extraction.
These models are jointly optimized with task-specific loss, and are regularized to generate similar predictions.
In the end, we can take any of the trained models for inference.
arXiv Detail & Related papers (2021-04-17T22:49:12Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.