Related papers: ED-FAITH: Evaluating Dialogue Summarization on Faithfulness

ED-FAITH: Evaluating Dialogue Summarization on Faithfulness

URL: http://arxiv.org/abs/2211.08464v1
Date: Tue, 15 Nov 2022 19:33:50 GMT
Title: ED-FAITH: Evaluating Dialogue Summarization on Faithfulness
Authors: Sicong Huang, Asli Celikyilmaz, Haoran Li
Abstract summary: We first present a systematic study of faithfulness metrics for dialogue summarization. We observe that most metrics correlate poorly with human judgements despite performing well on news datasets. We propose T0-Score -- a new metric for faithfulness evaluation.
Score: 35.73012379398233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics' performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score -- a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.

Related papers

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z)
Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z)
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics [11.541368732416506]
We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics. We call for research focusing on user-centric quality dimensions that consider the generated summary's communicative goal.
arXiv Detail & Related papers (2024-09-29T01:30:13Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
SWING: Balancing Coverage and Faithfulness for Dialogue Summarization [67.76393867114923]
We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies. We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
arXiv Detail & Related papers (2023-01-25T09:33:11Z)
Evaluating and Improving Factuality in Multimodal Abstractive Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary. We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization. Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z)
Analyzing and Evaluating Faithfulness in Dialogue Summarization [67.07947198421421]
We first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues. We present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations.
arXiv Detail & Related papers (2022-10-21T07:22:43Z)
TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning. Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT. Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)
Learning by Semantic Similarity Makes Abstractive Summarization Better [13.324006587838522]
We compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM. Interestingly, model-generated summaries receive higher scores relative to reference summaries.
arXiv Detail & Related papers (2020-02-18T17:59:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.