Related papers: Less is More for Long Document Summary Evaluation by LLMs

Less is More for Long Document Summary Evaluation by LLMs

URL: http://arxiv.org/abs/2309.07382v2
Date: Thu, 18 Jan 2024 18:23:37 GMT
Title: Less is More for Long Document Summary Evaluation by LLMs
Authors: Yunshu Wu, Hayate Iso, Pouya Pezeshkpour, Nikita Bhutani, Estevam Hruschka
Abstract summary: This paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations.
Score: 8.329113698912572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.

Related papers

ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking [8.244386008877441]
We introduce a novel two-stage training approach, ProRank, for SLM-based document reranking.<n>First, we propose a prompt warmup stage using reinforcement learning GRPO to steer SLMs to understand task prompts.<n>Then, we continuously fine-tune the SLMs with a fine-grained score learning stage without introducing additional layers to further improve the reranking quality.
arXiv Detail & Related papers (2025-06-04T02:00:44Z)
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References [95.29800580588592]
RevisEval is a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated.
arXiv Detail & Related papers (2024-10-07T16:50:47Z)
Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization. EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity. The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z)
$T^5Score$: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets [16.516381474175986]
We introduce $T5Score, an evaluation methodology that decomposes the quality of a topic into quantifiable aspects. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score.
arXiv Detail & Related papers (2024-07-24T16:14:15Z)
Cutting Through the Clutter: The Potential of LLMs for Efficient Filtration in Systematic Literature Reviews [7.355182982314533]
We evaluate Large Language Models (LLMs) for enhancing efficiency and accuracy in literature filtration. Open-source tool LLMSurver presents a visual interface to utilize LLMs for literature filtration. Findings show that recent LLM models can reduce filtering time from weeks to minutes.
arXiv Detail & Related papers (2024-07-15T12:13:53Z)
A Comparative Study of Quality Evaluation Methods for Text Summarization [0.5512295869673147]
This paper proposes a novel method based on large language models (LLMs) for evaluating text summarization. Our results show that LLMs evaluation aligns closely with human evaluation, while widely-used automatic metrics such as ROUGE-2, BERTScore, and SummaC do not and also lack consistency.
arXiv Detail & Related papers (2024-06-30T16:12:37Z)
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning [7.457517083017178]
Large language models (LLMs) are used for evaluation of text generated by humans and AI alike. Despite their utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors.
arXiv Detail & Related papers (2023-09-24T17:15:58Z)
Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs) Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z)
PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents [78.27865456183397]
We propose PEARL, a prompting framework to improve reasoning over long documents. Each stage of PEARL is implemented via zero-shot or few-shot prompting with minimal human input. We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts.
arXiv Detail & Related papers (2023-05-23T23:06:04Z)
Improving Language Models via Plug-and-Play Retrieval Feedback [42.786225163763376]
Large language models (LLMs) exhibit remarkable performance across various NLP tasks. They often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios. We introduce ReFeed, a novel pipeline designed to enhance LLMs by providing automatic retrieval feedback in a plug-and-play framework.
arXiv Detail & Related papers (2023-05-23T12:29:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.