HaRiM$^+$: Evaluating Summary Quality with Hallucination Risk
- URL: http://arxiv.org/abs/2211.12118v1
- Date: Tue, 22 Nov 2022 09:36:41 GMT
- Title: HaRiM$^+$: Evaluating Summary Quality with Hallucination Risk
- Authors: Seonil Son, Junsoo Park, Jeong-in Hwang, Junghwa Lee, Hyungjong Noh,
Yeonsoo Lee
- Abstract summary: We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods.
For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval.
- Score: 0.6617666829632144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the challenges of developing a summarization model arises from the
difficulty in measuring the factual inconsistency of the generated text. In
this study, we reinterpret the decoder overconfidence-regularizing objective
suggested in (Miao et al., 2021) as a hallucination risk measurement to better
estimate the quality of generated summaries. We propose a reference-free
metric, HaRiM+, which only requires an off-the-shelf summarization model to
compute the hallucination risk based on token likelihoods. Deploying it
requires no additional training of models or ad-hoc modules, which usually need
alignment to human judgments. For summary-quality estimation, HaRiM+ records
state-of-the-art correlation to human judgment on three summary-quality
annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits
the use of summarization models, facilitates the progress of both automated
evaluation and generation of summary.
Related papers
- What's Wrong? Refining Meeting Summaries with LLM Feedback [6.532478490187084]
We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process.
We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types.
We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence.
arXiv Detail & Related papers (2024-07-16T17:10:16Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - How Ready are Pre-trained Abstractive Models and LLMs for Legal Case
Judgement Summarization? [4.721618284417204]
In recent years, abstractive summarization models are gaining popularity.
Legal domain-specific pre-trained abstractive summarization models are now available.
General-domain pre-trained Large Language Models (LLMs) are known to generate high-quality text.
arXiv Detail & Related papers (2023-06-02T03:16:19Z) - Towards Improving Faithfulness in Abstractive Summarization [37.19777407790153]
We propose a Faithfulness Enhanced Summarization model (FES) to improve fidelity in abstractive summarization.
Our model outperforms strong baselines in experiments on CNN/DM and XSum.
arXiv Detail & Related papers (2022-10-04T19:52:09Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection.
Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text.
Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - On Faithfulness and Factuality in Abstractive Summarization [17.261247316769484]
We analyzed limitations of neural text generation models for abstractive document summarization.
We found that these models are highly prone to hallucinate content that is unfaithful to the input document.
We show that textual entailment measures better correlate with faithfulness than standard metrics.
arXiv Detail & Related papers (2020-05-02T00:09:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.