Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation
- URL: http://arxiv.org/abs/2502.08514v2
- Date: Thu, 13 Feb 2025 14:34:29 GMT
- Title: Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation
- Authors: Mahnaz Koupaee, Jake W. Vincent, Saab Mansour, Igor Shalyminov, Han He, Hwanjun Song, Raphael Shu, Jianfeng He, Yi Nian, Amy Wing-mei Wong, Kyu J. Han, Hang Su,
- Abstract summary: We propose an approach to summary faithfulness evaluation in which multiple agents are assigned initial stances.
We introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases.
Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
- Score: 29.44609627447293
- License:
- Abstract: Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such special cases. Experiments demonstrate our approach can help identify ambiguities, and have even a stronger performance on non-ambiguous summaries.
Related papers
- Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.
Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - On Positional Bias of Faithfulness for Long-form Summarization [83.63283027830657]
Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs.
We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias.
arXiv Detail & Related papers (2024-10-31T03:50:15Z) - STORYSUMM: Evaluating Faithfulness in Story Summarization [31.94902013480574]
We introduce a new dataset, STORYSUMM, comprising short stories with localized faithfulness labels and error explanations.
This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies.
arXiv Detail & Related papers (2024-07-09T02:06:30Z) - FABLES: Evaluating faithfulness and content selection in book-length summarization [55.50680057160788]
In this paper, we conduct the first large-scale human evaluation of faithfulness and content selection on book-length documents.
We collect FABLES, a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, at a cost of $5.2K USD.
An analysis of the annotations reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
arXiv Detail & Related papers (2024-04-01T17:33:38Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Fine-Grained Natural Language Inference Based Faithfulness Evaluation
for Diverse Summarisation Tasks [14.319567507959759]
We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness.
We propose a novel approach, namely InFusE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses.
arXiv Detail & Related papers (2024-02-27T15:57:11Z) - Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.
We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.
We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z) - Interpretable Automatic Fine-grained Inconsistency Detection in Text
Summarization [56.94741578760294]
We propose the task of fine-grained inconsistency detection, the goal of which is to predict the fine-grained types of factual errors in a summary.
Motivated by how humans inspect factual inconsistency in summaries, we propose an interpretable fine-grained inconsistency detection model, FineGrainFact.
arXiv Detail & Related papers (2023-05-23T22:11:47Z) - Analyzing and Evaluating Faithfulness in Dialogue Summarization [67.07947198421421]
We first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues.
We present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations.
arXiv Detail & Related papers (2022-10-21T07:22:43Z) - Beyond Model Interpretability: On the Faithfulness and Adversarial
Robustness of Contrastive Textual Explanations [2.543865489517869]
This work motivates textual counterfactuals by laying the ground for a novel evaluation scheme inspired by the faithfulness of explanations.
Experiments on sentiment analysis data show that the connectedness of counterfactuals to their original counterparts is not obvious in both models.
arXiv Detail & Related papers (2022-10-17T09:50:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.