SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits
- URL: http://arxiv.org/abs/2412.13378v2
- Date: Fri, 30 May 2025 19:47:02 GMT
- Title: SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits
- Authors: Onkar Thorat, Philippe Laban, Chien-Sheng Wu,
- Abstract summary: We introduce SummExecEdit, a novel pipeline and benchmark to assess models on their ability to both detect factual errors and provide accurate explanations.<n>The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark.<n>We identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.
- Score: 31.98028879922584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel pipeline and benchmark leveraging executable edits to assess models on their ability to both detect factual errors and provide accurate explanations. The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark, with individual scores of 0.67 for detection and 0.73 for explanation. We conduct detailed evaluations to assess the current state of models in this field and find that more than half of the 20+ LLMs in our study struggle with over 30% of the SummExecEdit benchmark. Additionally, we identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.
Related papers
- Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o [1.4401311275746886]
This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o.<n>We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90.<n>Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try.
arXiv Detail & Related papers (2026-02-26T01:46:40Z) - Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research [19.31559944205485]
Operations Research practitioners routinely debug infeasible models through an iterative process.<n>We introduce two benchmarks that place the textbfsolver in the evaluation loop<n>We find that domain-specific RLVR training enables an 8B model to surpass frontier APIs.
arXiv Detail & Related papers (2026-01-28T20:02:44Z) - Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z) - Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline [58.832237984587664]
We develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation.<n>We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark; (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments; and (3) An innovative Critic-and-Revise pipeline, achieves substantial improvements in caption factuality.
arXiv Detail & Related papers (2025-06-09T10:57:26Z) - VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation [0.8087612190556891]
VADER comprises 174 real-world software vulnerabilities, each carefully curated from GitHub and annotated by security experts.<n>For each vulnerability case, models are tasked with identifying the flaw, classifying it using Common Weaknession (CWE), explaining its underlying cause, proposing a patch, and formulating a test plan.<n>Using a one-shot prompting strategy, we benchmark six state-of-the-art LLMs (Claude 3.7 Sonnet, Gemini 2.5 Pro, GPT-4.1, GPT-4.5, Grok 3 Beta, and o3) on VADER.<n>Our results show that current state-of-the-
arXiv Detail & Related papers (2025-05-26T01:20:44Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - YourBench: Easy Custom Evaluation Sets for Everyone [12.995134931278056]
YourBench is a novel, open-source framework for evaluating large language models (LLMs)<n>It generates reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation.<n>We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces.
arXiv Detail & Related papers (2025-04-02T15:40:24Z) - STORYSUMM: Evaluating Faithfulness in Story Summarization [31.94902013480574]
We introduce a new dataset, STORYSUMM, comprising short stories with localized faithfulness labels and error explanations.
This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies.
arXiv Detail & Related papers (2024-07-09T02:06:30Z) - Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - Assessing the Efficacy of Grammar Error Correction: A Human Evaluation
Approach in the Japanese Context [10.047123247001714]
We evaluate the performance of the state-of-the-art sequence tagging grammar error detection and correction model (SeqTagger)
With an automatic annotation toolkit, ERRANT, we first evaluated SeqTagger's performance on error correction with human expert correction as the benchmark.
Results indicated a precision of 63.66% and a recall of 20.19% for error correction in the full dataset.
arXiv Detail & Related papers (2024-02-28T06:43:43Z) - AttributionBench: How Hard is Automatic Attribution Evaluation? [19.872081697282002]
We present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets.
Our experiments show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation.
A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information.
arXiv Detail & Related papers (2024-02-23T04:23:33Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Zero-shot Faithful Factual Error Correction [53.121642212060536]
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models.
We present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence.
arXiv Detail & Related papers (2023-05-13T18:55:20Z) - BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of
Faithfulness Metrics [70.52570641514146]
We present a benchmark of unfaithful minimal pairs (BUMP)
BUMP is a dataset of 889 human-written, minimally different summary pairs.
Unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics.
arXiv Detail & Related papers (2022-12-20T02:17:30Z) - Evaluating the Factual Consistency of Large Language Models Through News
Summarization [97.04685401448499]
We propose a new benchmark called FIB(Factual Inconsistency Benchmark) that focuses on the task of summarization.
For factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent.
For factually inconsistent summaries, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent.
arXiv Detail & Related papers (2022-11-15T18:50:34Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.