Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of
Lexical Overlap in Train and Test Reference Summaries
- URL: http://arxiv.org/abs/2311.09458v1
- Date: Wed, 15 Nov 2023 23:47:53 GMT
- Title: Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of
Lexical Overlap in Train and Test Reference Summaries
- Authors: Prafulla Kumar Choubey and Alexander R. Fabbri and Caiming Xiong and
Chien-Sheng Wu
- Abstract summary: Ideal summarization models should generalize to novel summary-worthy content without remembering reference training summaries by rote.
We propose a fine-grained evaluation protocol by partitioning a test set based on the lexical similarity of reference test summaries with training summaries.
- Score: 131.80860903537172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ideal summarization models should generalize to novel summary-worthy content
without remembering reference training summaries by rote. However, a single
average performance score on the entire test set is inadequate in determining
such model competencies. We propose a fine-grained evaluation protocol by
partitioning a test set based on the lexical similarity of reference test
summaries with training summaries. We observe up to a 5x (1.2x) difference in
ROUGE-2 (entity recall) scores between the subsets with the lowest and highest
similarity. Next, we show that such training repetitions also make a model
vulnerable to rote learning, reproducing data artifacts such as factual errors,
especially when reference test summaries are lexically close to training
summaries. Consequently, we propose to limit lexical repetitions in training
summaries during both supervised fine-tuning and likelihood calibration stages
to improve the performance on novel test cases while retaining average
performance. Our automatic and human evaluations on novel test subsets and
recent news articles show that limiting lexical repetitions in training
summaries can prevent rote learning and improve generalization.
Related papers
- Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Learning with Rejection for Abstractive Text Summarization [42.15551472507393]
We propose a training objective for abstractive summarization based on rejection learning.
We show that our method considerably improves the factuality of generated summaries in automatic and human evaluations.
arXiv Detail & Related papers (2023-02-16T19:07:08Z) - Uncontrolled Lexical Exposure Leads to Overestimation of Compositional
Generalization in Pretrained Models [31.573015421633155]
We argue that exposure to pretraining data may break distributional control.
We find that both of these setups lead to lower generalization performance in T5.
arXiv Detail & Related papers (2022-12-21T05:02:08Z) - Correcting Diverse Factual Errors in Abstractive Summarization via
Post-Editing and Language Model Infilling [56.70682379371534]
We show that our approach vastly outperforms prior methods in correcting erroneous summaries.
Our model -- FactEdit -- improves factuality scores by over 11 points on CNN/DM and over 31 points on XSum.
arXiv Detail & Related papers (2022-10-22T07:16:19Z) - COLO: A Contrastive Learning based Re-ranking Framework for One-Stage
Summarization [84.70895015194188]
We propose a Contrastive Learning based re-ranking framework for one-stage summarization called COLO.
COLO boosts the extractive and abstractive results of one-stage systems on CNN/DailyMail benchmark to 44.58 and 46.33 ROUGE-1 score.
arXiv Detail & Related papers (2022-09-29T06:11:21Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in
Abstractive Summarization [6.017006996402699]
We study generating abstractive summaries that are faithful and factually consistent with the given articles.
A novel contrastive learning formulation is presented, which leverages both reference summaries, as positive training data, and automatically generated erroneous summaries, as negative training data, to train summarization systems that are better at distinguishing between them.
arXiv Detail & Related papers (2021-09-19T20:05:21Z) - Noisy Self-Knowledge Distillation for Text Summarization [83.49809205891496]
We apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training.
Our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training.
We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers.
arXiv Detail & Related papers (2020-09-15T12:53:09Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z) - Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM.
Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.