News Summarization and Evaluation in the Era of GPT-3
- URL: http://arxiv.org/abs/2209.12356v2
- Date: Tue, 23 May 2023 22:52:18 GMT
- Title: News Summarization and Evaluation in the Era of GPT-3
- Authors: Tanya Goyal, Junyi Jessy Li, Greg Durrett
- Abstract summary: We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
- Score: 73.48220043216087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent success of prompting large language models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text
summarization, focusing on the classic benchmark domain of news summarization.
First, we investigate how GPT-3 compares against fine-tuned models trained on
large summarization datasets. We show that not only do humans overwhelmingly
prefer GPT-3 summaries, prompted using only a task description, but these also
do not suffer from common dataset-specific issues such as poor factuality.
Next, we study what this means for evaluation, particularly the role of gold
standard test sets. Our experiments show that both reference-based and
reference-free automatic metrics cannot reliably evaluate GPT-3 summaries.
Finally, we evaluate models on a setting beyond generic summarization,
specifically keyword-based summarization, and show how dominant fine-tuning
approaches compare to prompting.
To support further research, we release: (a) a corpus of 10K generated
summaries from fine-tuned and prompt-based models across 4 standard
summarization benchmarks, (b) 1K human preference judgments comparing different
systems for generic- and keyword-based summarization.
Related papers
- CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its
Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis.
Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification.
The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z) - Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation [20.675242617417677]
Cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding.
This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation.
arXiv Detail & Related papers (2023-06-22T14:31:18Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - InheritSumm: A General, Versatile and Compact Summarizer by Distilling
from GPT [75.29359361404073]
InheritSumm is a versatile and compact summarization model derived from GPT-3.5 through distillation.
It achieves similar or superior performance to GPT-3.5 in zeroshot and fewshot settings.
arXiv Detail & Related papers (2023-05-22T14:52:32Z) - Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3
(with Varying Success) [36.646495151276326]
GPT-3 is able to produce high quality summaries of general domain news articles in few- and zero-shot settings.
We enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision.
arXiv Detail & Related papers (2023-05-10T16:40:37Z) - Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error
Correction [28.58384091374763]
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks.
We perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks.
We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting.
arXiv Detail & Related papers (2023-03-25T03:08:49Z) - Prompted Opinion Summarization with GPT-3.5 [115.95460650578678]
We show that GPT-3.5 models achieve very strong performance in human evaluation.
We argue that standard evaluation metrics do not reflect this, and introduce three new metrics targeting faithfulness, factuality, and genericity.
arXiv Detail & Related papers (2022-11-29T04:06:21Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.