Related papers: Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method

URL: http://arxiv.org/abs/2305.13412v1
Date: Mon, 22 May 2023 18:54:35 GMT
Title: Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method
Authors: Yiming Wang, Zhuosheng Zhang, Rui Wang
Abstract summary: Automatic summarization generates concise summaries that contain key ideas of source documents. References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy. We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
Score: 35.181659789684545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic summarization generates concise summaries that contain key ideas of source documents. As the most mainstream datasets for the news sub-domain, CNN/DailyMail and BBC XSum have been widely used for performance benchmarking. However, the reference summaries of those datasets turn out to be noisy, mainly in terms of factual hallucination and information redundancy. To address this challenge, we first annotate new expert-writing Element-aware test sets following the "Lasswell Communication Model" proposed by Lasswell (1948), allowing reference summaries to focus on more fine-grained news elements objectively and comprehensively. Utilizing the new test sets, we observe the surprising zero-shot summary ability of LLMs, which addresses the issue of the inconsistent results between human preference and automatic evaluation metrics of LLMs' zero-shot summaries in prior work. Further, we propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step, which helps them integrate more fine-grained details of source documents into the final summaries that correlate with the human writing mindset. Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L on the two datasets, respectively. Dataset and code are publicly available at https://github.com/Alsace08/SumCoT.

Related papers

Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models [0.0]
Large Language Models (LLMs) have shown promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. We find that all summarization models produce consistent summaries when tested on the XL-Sum dataset.
arXiv Detail & Related papers (2025-02-28T01:58:17Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization. EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity. The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z)
Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback. Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z)
FineSurE: Fine-grained Summarization Evaluation using LLMs [22.62504593575933]
FineSurE is a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs) It also employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment.
arXiv Detail & Related papers (2024-07-01T02:20:28Z)
LaMSUM: Creating Extractive Summaries of User Generated Content using LLMs [6.770555526416268]
Large Language Models (LLMs) have demonstrated impressive performance across a wide range of NLP tasks, including summarization. We introduce LaMSUM, a novel framework designed to generate extractive summaries from large collections of user-generated text.
arXiv Detail & Related papers (2024-06-22T10:25:55Z)
Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing [37.400757839157116]
Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles. We propose relevance paraphrasing, a simple strategy that can be used to measure the robustness of LLMs as summarizers.
arXiv Detail & Related papers (2024-06-06T12:08:43Z)
On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z)
BooookScore: A systematic exploration of book-length summarization in the era of LLMs [53.42917858142565]
We develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. We find that closed-source LLMs such as GPT-4 and 2 produce summaries with higher BooookScore than those generated by open-source models.
arXiv Detail & Related papers (2023-10-01T20:46:44Z)
Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs) Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z)
LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS. We establish baselines with state-of-the-art summarization models. We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z)
Evaluating the Factual Consistency of Large Language Models Through News Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization. For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.