SWING: Balancing Coverage and Faithfulness for Dialogue Summarization
- URL: http://arxiv.org/abs/2301.10483v1
- Date: Wed, 25 Jan 2023 09:33:11 GMT
- Title: SWING: Balancing Coverage and Faithfulness for Dialogue Summarization
- Authors: Kung-Hsiang Huang, Siffi Singh, Xiaofei Ma, Wei Xiao, Feng Nan,
Nicholas Dingwall, William Yang Wang and Kathleen McKeown
- Abstract summary: We propose to utilize natural language inference (NLI) models to improve coverage while avoiding factual inconsistencies.
We use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered.
Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach.
- Score: 67.76393867114923
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Missing information is a common issue of dialogue summarization where some
information in the reference summaries is not covered in the generated
summaries. To address this issue, we propose to utilize natural language
inference (NLI) models to improve coverage while avoiding introducing factual
inconsistencies. Specifically, we use NLI to compute fine-grained training
signals to encourage the model to generate content in the reference summaries
that have not been covered, as well as to distinguish between factually
consistent and inconsistent generated sentences. Experiments on the DialogSum
and SAMSum datasets confirm the effectiveness of the proposed approach in
balancing coverage and faithfulness, validated with automatic metrics and human
evaluations. Additionally, we compute the correlation between commonly used
automatic metrics with human judgments in terms of three different dimensions
regarding coverage and factual consistency to provide insight into the most
suitable metric for evaluating dialogue summaries.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - CASPR: Automated Evaluation Metric for Contrastive Summarization [4.310460539747285]
We propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries.
Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.
arXiv Detail & Related papers (2024-04-23T23:27:29Z) - Semi-Supervised Dialogue Abstractive Summarization via High-Quality
Pseudolabel Selection [27.531083525683243]
Semi-supervised dialogue summarization (SSDS) leverages model-generated summaries to reduce reliance on human-labeled data.
We propose a novel scoring approach, SiCF, which encapsulates three primary dimensions of summarization model quality.
arXiv Detail & Related papers (2024-03-06T22:06:23Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - ED-FAITH: Evaluating Dialogue Summarization on Faithfulness [35.73012379398233]
We first present a systematic study of faithfulness metrics for dialogue summarization.
We observe that most metrics correlate poorly with human judgements despite performing well on news datasets.
We propose T0-Score -- a new metric for faithfulness evaluation.
arXiv Detail & Related papers (2022-11-15T19:33:50Z) - Evaluating the Factual Consistency of Large Language Models Through News
Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z) - SNaC: Coherence Error Detection for Narrative Summarization [73.48220043216087]
We introduce SNaC, a narrative coherence evaluation framework rooted in fine-grained annotations for long summaries.
We develop a taxonomy of coherence errors in generated narrative summaries and collect span-level annotations for 6.6k sentences across 150 book and movie screenplay summaries.
Our work provides the first characterization of coherence errors generated by state-of-the-art summarization models and a protocol for eliciting coherence judgments from crowd annotators.
arXiv Detail & Related papers (2022-05-19T16:01:47Z) - CONFIT: Toward Faithful Dialogue Summarization with
Linguistically-Informed Contrastive Fine-tuning [5.389540975316299]
Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization.
We provide a typology of factual errors with annotation data to highlight the types of errors and move away from a binary understanding of factuality.
We propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called ConFiT.
arXiv Detail & Related papers (2021-12-16T09:08:40Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.