LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form
Summarization
- URL: http://arxiv.org/abs/2301.13298v1
- Date: Mon, 30 Jan 2023 21:31:48 GMT
- Title: LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form
Summarization
- Authors: Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep
Dasigi, Arman Cohan, Kyle Lo
- Abstract summary: LongEval is a set of guidelines for human evaluation of faithfulness in long-form summaries.
We deploy LongEval in annotation studies on two long-form summarization datasets in different domains.
- Score: 48.02158981582502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While human evaluation remains best practice for accurately judging the
faithfulness of automatically-generated summaries, few solutions exist to
address the increased difficulty and workload when evaluating long-form
summaries. Through a survey of 162 papers on long-form summarization, we first
shed light on current human evaluation practices surrounding long-form
summaries. We find that 73% of these papers do not perform any human evaluation
on model-generated summaries, while other works face new difficulties that
manifest when dealing with long documents (e.g., low inter-annotator
agreement). Motivated by our survey, we present LongEval, a set of guidelines
for human evaluation of faithfulness in long-form summaries that addresses the
following challenges: (1) How can we achieve high inter-annotator agreement on
faithfulness scores? (2) How can we minimize annotator workload while
maintaining accurate faithfulness scores? and (3) Do humans benefit from
automated alignment between summary and source snippets? We deploy LongEval in
annotation studies on two long-form summarization datasets in different domains
(SQuALITY and PubMed), and we find that switching to a finer granularity of
judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness
scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a
partial annotation of fine-grained units highly correlates with scores from a
full annotation workload (0.89 Kendall's tau using 50% judgments). We release
our human judgments, annotation templates, and our software as a Python library
for future research.
Related papers
- On Positional Bias of Faithfulness for Long-form Summarization [83.63283027830657]
Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs.
We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias.
arXiv Detail & Related papers (2024-10-31T03:50:15Z) - STORYSUMM: Evaluating Faithfulness in Story Summarization [31.94902013480574]
We introduce a new dataset, STORYSUMM, comprising short stories with localized faithfulness labels and error explanations.
This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies.
arXiv Detail & Related papers (2024-07-09T02:06:30Z) - FABLES: Evaluating faithfulness and content selection in book-length summarization [55.50680057160788]
In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on book-length documents.
We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD.
An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
arXiv Detail & Related papers (2024-04-01T17:33:38Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Hybrid Long Document Summarization using C2F-FAR and ChatGPT: A
Practical Study [1.933681537640272]
ChatGPT is the latest breakthrough in the field of large language models (LLMs)
We propose a hybrid extraction and summarization pipeline for long documents such as business articles and books.
Our results show that the use of ChatGPT is a very promising but not yet mature approach for summarizing long documents.
arXiv Detail & Related papers (2023-06-01T21:58:33Z) - How Far are We from Robust Long Abstractive Summarization? [39.34743996451813]
We evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries.
For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary.
We release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.
arXiv Detail & Related papers (2022-10-30T03:19:50Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - FEQA: A Question Answering Evaluation Framework for Faithfulness
Assessment in Abstractive Summarization [34.2456005415483]
We tackle the problem of evaluating faithfulness of a generated summary given its source document.
We find that current models exhibit a trade-off between abstractiveness and faithfulness.
We propose an automatic question answering (QA) based metric for faithfulness.
arXiv Detail & Related papers (2020-05-07T21:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.