Element-aware Summarization with Large Language Models: Expert-aligned
Evaluation and Chain-of-Thought Method
- URL: http://arxiv.org/abs/2305.13412v1
- Date: Mon, 22 May 2023 18:54:35 GMT
- Title: Element-aware Summarization with Large Language Models: Expert-aligned
Evaluation and Chain-of-Thought Method
- Authors: Yiming Wang, Zhuosheng Zhang, Rui Wang
- Abstract summary: Automatic summarization generates concise summaries that contain key ideas of source documents.
References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy.
We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step.
Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
- Score: 35.181659789684545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic summarization generates concise summaries that contain key ideas of
source documents. As the most mainstream datasets for the news sub-domain,
CNN/DailyMail and BBC XSum have been widely used for performance benchmarking.
However, the reference summaries of those datasets turn out to be noisy, mainly
in terms of factual hallucination and information redundancy. To address this
challenge, we first annotate new expert-writing Element-aware test sets
following the "Lasswell Communication Model" proposed by Lasswell (1948),
allowing reference summaries to focus on more fine-grained news elements
objectively and comprehensively. Utilizing the new test sets, we observe the
surprising zero-shot summary ability of LLMs, which addresses the issue of the
inconsistent results between human preference and automatic evaluation metrics
of LLMs' zero-shot summaries in prior work. Further, we propose a Summary
Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step
by step, which helps them integrate more fine-grained details of source
documents into the final summaries that correlate with the human writing
mindset. Experimental results show our method outperforms state-of-the-art
fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L on the two
datasets, respectively. Dataset and code are publicly available at
https://github.com/Alsace08/SumCoT.
Related papers
- Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization.
EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity.
The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs)
It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z) - LaMSUM: Creating Extractive Summaries of User Generated Content using LLMs [6.770555526416268]
Large Language Models (LLMs) have demonstrated impressive performance across a wide range of NLP tasks, including summarization.
We introduce LaMSUM, a novel framework designed to generate extractive summaries from large collections of user-generated text.
arXiv Detail & Related papers (2024-06-22T10:25:55Z) - Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing [37.400757839157116]
Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles.
We propose relevance paraphrasing, a simple strategy that can be used to measure the robustness of LLMs as summarizers.
arXiv Detail & Related papers (2024-06-06T12:08:43Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - BooookScore: A systematic exploration of book-length summarization in the era of LLMs [53.42917858142565]
We develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types.
We find that closed-source LLMs such as GPT-4 and 2 produce summaries with higher BooookScore than those generated by open-source models.
arXiv Detail & Related papers (2023-10-01T20:46:44Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS.
We establish baselines with state-of-the-art summarization models.
We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z) - Evaluating the Factual Consistency of Large Language Models Through News
Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.