Evaluating and Improving Factuality in Multimodal Abstractive
Summarization
- URL: http://arxiv.org/abs/2211.02580v1
- Date: Fri, 4 Nov 2022 16:50:40 GMT
- Title: Evaluating and Improving Factuality in Multimodal Abstractive
Summarization
- Authors: David Wan and Mohit Bansal
- Abstract summary: We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary.
We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization.
Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
- Score: 91.46015013816083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current metrics for evaluating factuality for abstractive document
summarization have achieved high correlations with human judgment, but they do
not account for the vision modality and thus are not adequate for
vision-and-language summarization. We propose CLIPBERTScore, a simple weighted
combination of CLIPScore and BERTScore to leverage the robustness and strong
factuality detection performance between image-summary and document-summary,
respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate
the quality of multimodal factuality metrics, we collect human judgments of
factuality with respect to documents and images. We show that this simple
combination of two metrics in the zero-shot setting achieves higher
correlations than existing factuality metrics for document summarization,
outperforms an existing multimodal summarization metric, and performs
competitively with strong multimodal factuality metrics specifically fine-tuned
for the task. Our thorough analysis demonstrates the robustness and high
correlation of CLIPBERTScore and its components on four factuality
metric-evaluation benchmarks. Finally, we demonstrate two practical downstream
applications of our CLIPBERTScore metric: for selecting important images to
focus on during training, and as a reward for reinforcement learning to improve
factuality of multimodal summary generation w.r.t automatic and human
evaluation. Our data and code are publicly available at
https://github.com/meetdavidwan/faithful-multimodal-summ
Related papers
- MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - A Multi-Document Coverage Reward for RELAXed Multi-Document
Summarization [11.02198476454955]
We propose fine-tuning an MDS baseline with a reward that balances a reference-based metric with coverage of the input documents.
Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline.
arXiv Detail & Related papers (2022-03-06T07:33:01Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.