Recent progress of abstractive text summarization largely relies on large
pre-trained sequence-to-sequence Transformer models, which are computationally
expensive. This paper aims to distill these large models into smaller ones for
faster inference and minimal performance loss. Pseudo-labeling based methods
are popular in sequence-to-sequence model distillation. In this paper, we find
simply manipulating attention temperatures in Transformers can make pseudo
labels easier to learn for student models. Our experiments on three
summarization datasets show our proposed method consistently improves over
vanilla pseudo-labeling based methods. We also find that both the pseudo labels
and summaries produced by our students are shorter and more abstractive. We
will make our code and models publicly available.
In the literature, there are mainly two kinds of methods for summarization: extractive summarization and abstractive summarization .
In this work, we focus on abstractive summarization, which is viewed as a sequence-to-sequence (Seq2Seq) learning problem, since recent abstractive models outperform their extractive counterparts and can produce more concise summaries [26, 17, 42, 19].
With these extremely large models, we can obtain state-of-the-art summarization results, but they are slow for online inference, which makes them difﬁcult to be used in the production environment even with cutting-edge hardware.
An effective distillation method for Seq2Seq models is called pseudo-labeling , where the teacher model generates pseudo summaries for all documents in the training set and the resulting document-pseudo-summ ary pairs are used to train the student model.
In this paper, we argue that attention distributions of a teacher model might be too sharp.
As a result, pseudo labels generated from it are sub-optimal for student models.
In the summarization task, we observed that 1) pseudo summaries generated from our teacher model copy more continuous text spans from the original document than reference summaries (56% 4-grams in pseudo summaries and 15% 4-grams in reference summaries are copied from their original documents on CNN/DailyMail dataset);
Text spans in bold are copied spans (with more than four words) from the original document.
[Reference]: Mentally ill inmates in Miami are housed on the “forgotten ﬂoor” </s> Judge Steven Leifman says most are there as a result of “avoidable felonies” </s> While CNN tours facility, patient shouts: “I am the son of the president” </s> Leifman says the system is unjust and he’s ﬁghting for change.
</s> Judge Steven Leifman says about one-third of all people in Miami-Dade county jails are mentally ill. </s> He says they face drug charges or charges of assaulting an ofﬁcer, which are “avoidable felonies” </s> He says the arrests often result from confrontations with police, which exacerbate their illness.
[Smoothed ]: Mentally ill inmates in Miami are housed on the “forgotten ﬂoor” </s> Judge Steven Leifman says they are there because of “avoidable felonies” </s> He says many of them are in jail for drug or assault charges.
</s> He says the system is unjust and he’s trying to change it.
2) pseudo summaries tend to summarize the leading part of a document (measured on CNN/DailyMail, 74% of sentences in pseudo summaries and 64% of sentences in reference summaries are from the leading 40% sentences in the original documents).
In either case, the attention distribution is too sharp (i.e., attention weights of the next word position or the leading part is much larger than other positions), which means our teacher model is over-conﬁdent.
Indeed, by using a higher attention temperature (from 96), the copy bias is less severe (the ratio of copied 4-gram is reduced to 50% from 56%), as well as the leading bias (portion of sentences in pseudo summaries describing the leading 40% sentences in the document is reduced to 70% from 74%).
Figure 1 also shows the effect of using higher attention temperature.
There are shorter lines with high attention weights and positions of high attention weights extend to the ﬁrst 450 words.
Less copy bias in pseudo summaries encourages student models to be more abstractive, while less leading bias in pseudo summaries encourages our student models to take advantage of longer context in documents.
Experiments on CNN/DailyMail, XSum, and New York Times datasets with student models of different sizes show our simple distillation method consistently outperforms vanilla pseudo-labeling based methods.
CNN/DailyMail、XSum、New York Timesのデータセットと異なるサイズの学生モデルによる実験は、我々の単純な蒸留法がバニラ擬似ラベル法を一貫して上回っていることを示している。 訳抜け防止モード: 異なるサイズの学生モデルを用いたcnn/dailymail, xsum, new york timesデータセットの実験 簡易蒸留法はバニラ擬似ラベル法を一貫して上回っている。
With our method, we empirically ﬁnd that both pseudo summaries generated by our teacher models and summaries generated by our student models are shorter and more abstractive, which matches the goal of abstractive summarization.
These models achieve strong results in summarization but are slow during inference.
Our method can make them faster.
In knowledge distillation, a teacher model can be used to help the training of a student model.
In addition to learning from gold labels in the training set, student models can learn from the soft targets [1, 12], intermediate hidden states , attentions [41, 38], target output derivatives  of teacher models.
Recent work for distillation of pre-trained Transformers (e g , DistilBERT , TinyBERT , MobileBERT , BERT-of-Theseus , MINILM ) focuses on natural language understanding tasks such as GLUE  or SQuAD  benchmarks.
However, the sequence-level knowledge of teacher models is not well utilized.
Therefore, Kim and Rush  introduce a sequence-level knowledge distillation method (i.e., pseudo-labeling), where a student model is trained with pseudo labels generated by the teacher model using beam search decoding.
Kim and Rush  and later work [14, 9, 5] show pseudo-labeling achieves competitive performance for Seq2Seq tasks such as machine translation.
Kim and Rush  and later work [14, 9, 5] shows pseudo-labeling makes compete performance for Seq2Seq tasks such as machine translation。
Shleifer and Rush  propose the shrink and ﬁne-tune (SFT) approach for pre-trained summarization distillation, which re-ﬁnetunes a teacher model with some layers removed, and they show SFT outperforms pseudo-labeling and a modiﬁcation of direct knowledge distillation  on one of their datasets, but not others.
3 010203040506070Token index in summary0100200300400 500Token index in documentAttention Temperature: 64010203040506070Tok en index in summary0100200300400 500Attention Temperature: 960.050.100.150.20
3 01020404040406070 Token index in summary 0100200300400500 Token index in documentAttention temperature: 64010204040406070 Token index in summary 0100200400400500Atte ntion temperature: 960.050.100.150.20
the ﬁnal student is distilled with an ensemble of all available models.
最終生徒は 利用可能な全てのモデルの アンサンブルで蒸留されています
Xie et al  propose noisy student training, which injects input and model noise during student model training and improves image classiﬁcation performance on ImageNet .
Xie et al  は、学生モデルトレーニング中に入力とモデルノイズを注入し、ImageNet  上の画像分類性能を改善するノイズの多い学生訓練を提案する。
Liu et al  and He et al  observe that adding noise to teacher and/or student models during self-distillation can improve Seq2Seq tasks such as machine translation and summarization.
Liu et al  と He et al  は、自己蒸留中に教師や学生モデルにノイズを加えることで、機械翻訳や要約などのSeq2Seqタスクを改善することを観察している。 訳抜け防止モード: Liu et al [20 ] and He et al [10 ] 自己蒸留中に教師や学生モデルにノイズを加える 機械翻訳や要約といったSeq2Seqタスクを改善することができる。
Our method can also be applied in self-distillation and can potentially be combined with the self-distillation methods above.
また, 本法は自己蒸留にも適用でき, 上記の自己蒸留法と組み合わせることも可能である。
3 Summarization distillation In this section, we introduce our distillation method PLATE.
3 留分蒸留 本稿では,蒸留法プレートについて紹介する。
3.1 Transformer based abstractive summarization
Abstractive summarization aims to rewrite a document into its shorter form (i.e., summary), which is a typical Seq2Seq learning problem (note that the input and output are all sequences of tokens).
Kim and Rush  also shows the sequence-level pseudo-labeling based method obtains better performance than its token-level counterpart.
Kim and Rush  は、シーケンシャルレベルの擬似ラベルに基づく手法がトークンレベルの手法よりも優れた性能を得ることを示す。
Speciﬁcally, suppose we have a document X, and ˆY = (ˆy1, ˆy2, .
具体的には、x の文書と (y1, sy2, ) と仮定する。
. . , ˆy| ˆY |) is a pseudo summary generated by a teacher model using beam search.
. . は、ビーム探索を用いて教師モデルによって生成された疑似要約である。
The student can be trained by minimizing the negative log-likelihood of document-to-pseudo-s ummary pairs.
LPL(θ) = − 1 | ˆY |
LPL(θ) = − 1 | >Y |
log p(ˆyt|ˆy<t, X; θ)
log p(\yt|\y<t, x; θ)
(3) Strictly, all possible pseudo summaries from X should be taken into account.
(3) 厳密には、X からの全ての疑似要約を考慮に入れなければならない。
Unfortunately, the computational cost is prohibitive.
We therefore use a single sample ˆY (which takes a large portion of probability mass from the teacher) instead as in Kim and Rush .
そのため、kim や rush のように、1つのサンプル sy (教師の確率質量の大部分を取る) を代わりに使用します。
3.3 Re-scaling attention temperatures
Both our teacher and student models are Seq2Seq Transformer models.
The core part of a Transformer model is the attention module:
Attention(Q, K, V ) = softmax(
注意(Q, K, V) = Softmax()
QK T τ )V (4)
QK T τ )V (4)
√ √ √ √ d (d is the hidden dimension size of that attention head).
√ √ √ √ d (d はその注意ヘッドの隠れた寸法の大きさ)。
where Q, K, V are linear projections of hidden states of a layer and τ is the temperature of the attention module which is usually Our distillation method PLATE works as follows.
ここでは、Q, K, V は層の隠れ状態の線形射影であり、τ は注意モジュールの温度であり、これは通常、PLATE の蒸留法である。
Assume we have a teacher model trained with d. When generating pseudo labels from the teacher with beam search, we use a higher attention τ = temperature and set τ = λ d where λ > 1 (λ is the attention temperature coefﬁcient).
Their settings of student models are BART 12-6 on CNNDM and BART 12-3 on XSum.
学生モデルの設定はCNNDMのBART 12-6とXSumのBART 12-3である。
Results of our BART 12-3 and BART 12-6 student models are in the third and fourth block.
BART 12-3とBART 12-6の学生モデルの結果は、第3ブロックと第4ブロックにある。
We present results of students using gold labels (Gold) and regular pseudo labels (Regular) as well as pseudo labels with higher and random attention temperatures (PLATEB12-3 λ=2.0 and PLATEB12-3 λ=1.5 means that the student uses attention temperature coefﬁcient λ = 1.5 with architecture setting BART 12-3.
We use random attention temperature in PLATEB12-3 rnd with λ ∼ U [1.0, 2.0].
λ > U [1.0, 2.0] のPLATEB12-3 rndにおけるランダムアテンション温度を用いる。
We observe that using pseudo-labeling methods with higher attention temperatures consistently improves over its counterpart with normal attention temperatures (Regular) across all three datasets and the differences between them are almost always signiﬁcant measure with the λ=2.0 and PLATEB12-6 ROUGE script (see details in Table 2).
It is also interesting to see that students distilled with pseudo-labeling do improve gold label based students using randomly initialized Transformer, but not with pre-trained models (i.e., BART 12-6 and BART 12-3), which may also be due to the strong modeling power of large pre-trained Transformers.
We convert the ranks to rank ratings (rank i to 5 − i) and further conduct student t-test on these ratings.
我々はランクをランク評価(ランク i から 5 − i )に変換し、さらにこれらの評価で生徒のt-テストを行う。
As shown in Table 4, PLATEB12-6 λ=2.0 obtains the best ranking score and the difference between PLATEB12-6 λ=2.0 and the regular pseudo-labeling based method Regular is signiﬁcant (p < 0.05), which indicates our proposed method PLATE indeed produces better summaries.
Ablation study In a Seq2Seq Transformer model, there are three types of attention modules (i.e., encoder self-attention, decoder self-attention and decoder cross-attention and we can scale the attention temperatures for all of them or some of them.
Let λenc denote the attention temperature coefﬁcient for the encoder self-attention module, λcross the coefﬁcient for the decoder cross-attention module and λdec the coefﬁcient for the decoder self-attention module.
Besides, the attention temperature of the decoder self-attention is also crucial (see the fourth row).
4.5 Analysis Why does our distillation method work?
To answer this question, we try to analyze the reasons from both the external characteristics of the summaries generated by the teacher model and the internal characteristics of the teacher’s attention mechanism.
The mean proportions of evident attentions for all bins are shown in Figure 2.
Compared to the teacher with normal attention temperature (pink bar), teachers with higher attention temperatures (blue and green bars) attend less on the heading parts of documents while more on the tail parts of documents.
Further empirical analysis shows that by using our method, teacher models can generate more concise and abstractive summaries.
As a result, summaries produced by student models also become more concise and abstractive.
In the future, we would like to apply our method to other generation tasks as well as self-training with unlabeled data.
We are also interested in extending our method for better teacher model training.
References  Jimmy Ba and Rich Caruana.
ジミー・バ (Jimmy Ba) とリッチ・カラナ (Rich Caruana)。
Do deep nets really need to be deep?
In NIPS, 2014.
 Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al Unilmv2: Pseudo-masked language models for uniﬁed language model pre-training.
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al Unilmv2: Pseudo-masked language model for unified language model pre-training。
In International Conference on Machine Learning, pages 642–652.
 Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan In Proceedings of the 31st International
Wojciech Marian Czarnecki,Simon Osindero,Max Jaderberg,Grzegorz Swirszcz,Razvan In Proceedings of the 31th International
Pascanu. Sobolev training for neural networks.
Conference on Neural Information Processing Systems, pages 4281–4290, 2017.
conference on neural information processing systems, pages 4281–4290, 2017 (英語)
 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A largescale hierarchical image database.
In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
2009年、ieee conference on computer vision and pattern recognition 248-255ページ。
 Michael Denkowski and Graham Neubig.
Michael Denkowski氏とGraham Neubig氏。
Stronger baselines for trustable results in neural machine translation.
In Proceedings of the First Workshop on Neural Machine Translation, pages 18–27, Vancouver, August 2017.
Proceedings of the First Workshop on Neural Machine Translation, page 18–27, Vancouver, August 2017 (英語)
Association for Computational Linguistics. doi: 10.
URL https://www.aclweb.o rg/anthology/W17-320 3.
URL https://www.aclweb.o rg/anthology/W17-320 3。
 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova。
BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019.
The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), page 4171–4186, Minneapolis, Minnesota, June 2019
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423 .
計算言語学会会員。 doi: 10.18653/v1/n19-1423 。
URL https://www.aclweb.o rg/anthology/N19-142 3.
URL https://www.aclweb.o rg/anthology/N19-142 3
 Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740, Hong Kong, China, November 2019.
Compared with the regular pseudo-labeling method ([Regular]), the summary generated by our method PLATEB12-6 λ=1.5 omits the modiﬁer "Nirvana frontman" and "Nirvana bassist" of the person "Kurt Cobain" and "Krist Novoselic", respectively and the resulting summary is shorter and more abstractive.
The summary generated by our method PLATEB12-6 λ=2.0 contains the text "will premiere on HBO on May 4", which is at the end of the source document and included in the reference (i.e., summary worthy), but is ignored by [Regular].
A companion book containing art and archival documents from Cobain is being produced to accompany the ﬁlm.
[PLATEB12-6 λ=2.0]: "Montage of Heck" is directed by Brett Morgen and will premiere on HBO on May 4.
PLATEB12-6 λ=2.0]:"Montage of Heck"はブレット・モーゲンが監督し、5月4日にHBOで初演される。
A companion book containing art and archival documents from Cobain is being produced to accompany the documentary.
The soundtrack will include "a mind-blowing 12minute acoustic Cobain unheard track," Morgen says.
Example 2 The second example is shown in Table 9 (outputs) and Figure 4 (attention visualization).
In this example, the source document is relatively long (over 700 words).
As shown in Figure 4, the summary generated with the regular pseudo-labeling method Regular mainly focuses on the heading part of the source document (around the ﬁrst 150 words), but our method PLATEB12-6 λ=2.0 takes into account the tokens in the front, middle and tail of the source document.
In Table 9, the summary from PLATEB12-6 λ=2.0 contains the key sentence "Peter Bergen: Pilots are not different from other people, but they can be careless, lazy, inattentive and reckless", which is similar to the reference sentence "Peter Garrison: Pilots don’t exist on different moral plane than the rest of us".
表9では、PLATEB12-6 λ=2.0の要約に「Peter Bergen: Pilots are not different with other people, but they can be careless, lazy, inattentive and reckless」というキー文が含まれている。 訳抜け防止モード: 表9では、PLATEB12 - 6 λ=2.0 の要約がキー文を含む。 Peter Bergen氏: パイロットは他の人たちと変わりません。 しかし、彼らは不注意で、怠け者で、不注意で、無謀です。 これは"Peter Garrison: Pilots do not exist on different moral plane"という参照文に似ています。
The sentence "the human mind is the blackest of boxes" in the reference, which appears at the tail of the source document, is also included in summaries of PLATEB12-6 λ=2.0.
Peter Bergen: Pilots are not different from other people, but they can be careless, lazy, inattentive and reckless.
Peter Bergen氏: パイロットは他の人たちと変わりませんが、不注意で怠け者で、不注意で、無謀です。
He says the human mind is the blackest of boxes; no one can peer inside it.
14 010203040506070Token index in summary0255075100125 150175200Token index in documentAttention Temperature: 640102030405060Token index in summary0255075100125 150175200Attention Temperature: 960102030405060Token index in summary0255075100125 150175200Attention Temperature: 12126.96.36.199.40.50. 6
14 0102040406070Token index in summary 02550751001251501752 00 Token index in documentAttention temperature: 6402020404060 Token index in summary 02550501001251505052 00 Attention temperature: 9601020404060 Token index in summary 02550751001251501752 00 Attention temperature: 12188.8.131.52.50.6
Figure 4: Example 2 of visualization of cross attention weight when the student generate summaries with different attention temperatures.
15 0102030405060Token index in summary0100200300400 500600700Token index in documentAttention Temperature: 640102030405060Token index in summary0100200300400 500600700Attention Temperature: 96010203040506070Tok en index in summary0100200300400 500600700Attention Temperature: 12184.108.40.206.40.50. 6
15 0102040405060 Token index in summary 01002004004004006007 00 Token index in documentAttention temperature: 640102040405060 Token index in summary 01002004004004004005 060Attention temperature: 960102030405070 Token index in summary 01002004004004006007 00Attention temperature: 12220.127.116.11.50.6