Related papers: PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

URL: http://arxiv.org/abs/2503.08890v2
Date: Sun, 07 Sep 2025 23:40:32 GMT
Title: PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization
Authors: Zhiwen You, Yue Guo,
Abstract summary: Hallucinated outputs from large language models pose risks in the medical domain.<n>We introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact.
Score: 5.5899921245557
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents the first evaluation metric designed for PLS factual consistency evaluation, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact

Related papers

AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment [0.0]
We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts.<n>Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology.<n>We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training.
arXiv Detail & Related papers (2025-12-03T10:14:31Z)
MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification [51.82420076479152]
We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora.<n>Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts.
arXiv Detail & Related papers (2025-05-24T01:23:09Z)
Noiser: Bounded Input Perturbations for Attributing Large Language Models [17.82404809465846]
We introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding. We demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability.
arXiv Detail & Related papers (2025-04-03T10:59:37Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts. It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs) We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z)
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs) Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant Sentences [38.919090721583075]
We show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. We design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Our approach improves faithfulness scores by 20% over strong baselines on AnswerSumm citepfabbri 2021answersumm dataset.
arXiv Detail & Related papers (2022-12-19T18:51:06Z)
Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation [42.63902468258758]
We propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation. We conduct a series of experiments on three public abstractive text summarization datasets.
arXiv Detail & Related papers (2021-08-30T11:48:41Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.