Related papers: Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM

Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM

URL: http://arxiv.org/abs/2412.19906v1
Date: Fri, 27 Dec 2024 19:42:25 GMT
Title: Evaluate Summarization in Fine-Granularity: Auto Evaluation with LLM
Authors: Dong Yuan, Eti Rastogi, Fen Zhao, Sagar Goyal, Gautam Naik, Sree Prasanna Rajagopal,
Abstract summary: evaluating summarization accurately and objectively presents significant challenges.<n>Existing methods, such as ROUGE, often yield scores that have low correlation with human judgements.<n>We introduce a novel evaluation methodology and tooling designed to address these challenges.
Score: 11.995534662701132
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Due to the exponential growth of information and the need for efficient information consumption the task of summarization has gained paramount importance. Evaluating summarization accurately and objectively presents significant challenges, particularly when dealing with long and unstructured texts rich in content. Existing methods, such as ROUGE (Lin, 2004) and embedding similarities, often yield scores that have low correlation with human judgements and are also not intuitively understandable, making it difficult to gauge the true quality of the summaries. LLMs can mimic human in giving subjective reviews but subjective scores are hard to interpret and justify. They can be easily manipulated by altering the models and the tones of the prompts. In this paper, we introduce a novel evaluation methodology and tooling designed to address these challenges, providing a more comprehensive, accurate and interpretable assessment of summarization outputs. Our method (SumAutoEval) proposes and evaluates metrics at varying granularity levels, giving objective scores on 4 key dimensions such as completeness, correctness, Alignment and readability. We empirically demonstrate, that SumAutoEval enhances the understanding of output quality with better human correlation.

Related papers

Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification [13.381644813030725]
We introduce a synthetic benchmark for text simplification featuring simplified sentences generated by models of varying sizes. We show that human ratings on our benchmark exhibit high inter-annotator agreement and reflect the expected trend. Second, we show that auto-evaluation with a panel of LLM judges (LLMs-as-a-jury) often suffices to obtain consistent ratings for the evaluation of text simplification.
arXiv Detail & Related papers (2025-04-13T01:36:47Z)
What's Wrong? Refining Meeting Summaries with LLM Feedback [6.532478490187084]
We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process. We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types. We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence.
arXiv Detail & Related papers (2024-07-16T17:10:16Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback [57.816210168909286]
We leverage recent progress on textual entailment models to address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries.
arXiv Detail & Related papers (2023-05-31T21:04:04Z)
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks [45.550554287918885]
This paper focuses on evaluating the usefulness of text summaries with extrinsic methods. We design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks.
arXiv Detail & Related papers (2023-05-24T11:34:39Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Human-in-the-loop Abstractive Dialogue Summarization [61.4108097664697]
We propose to incorporate different levels of human feedback into the training process. This will enable us to guide the models to capture the behaviors humans care about for summaries.
arXiv Detail & Related papers (2022-12-19T19:11:27Z)
Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences [38.919090721583075]
We show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. We design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Our approach improves faithfulness scores by 20% over strong baselines on AnswerSumm citepfabbri 2021answersumm dataset.
arXiv Detail & Related papers (2022-12-19T18:51:06Z)
Improving Factual Consistency of Abstractive Summarization via Question Answering [25.725873545789046]
We present an approach to address factual consistency in summarization. We first propose an efficient automatic evaluation metric to measure factual consistency. We then propose a novel learning algorithm that maximizes the proposed metric during model training.
arXiv Detail & Related papers (2021-05-10T19:07:21Z)
Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries. We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.