Related papers: FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

URL: http://arxiv.org/abs/2402.11456v2
Date: Wed, 5 Jun 2024 01:27:00 GMT
Title: FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence
Authors: Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Göke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li,
Abstract summary: This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts. It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs) We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
Score: 46.71469172542448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.

Related papers

MedScore: Factuality Evaluation of Free-Form Medical Answers [54.722181966548895]
We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts.<n>Our method extracts up to three times more valid facts than existing methods.
arXiv Detail & Related papers (2025-05-24T01:23:09Z)
PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation [3.8868752812726064]
We introduce PlainQAFact, a framework trained on a fine-grained, human-annotated dataset PlainFact. PlainQAFact first classifies factuality type and then assesses factuality using a retrieval-augmented QA-based scoring method.
arXiv Detail & Related papers (2025-03-11T20:59:53Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Assessing the Limitations of Large Language Models in Clinical Fact Decomposition [3.919419934122265]
We present FactEHR, a dataset consisting of full document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems. Our evaluation, including review by clinicians, highlights significant variability in the quality of fact decomposition for four commonly used LLMs. Results underscore the need for better LLM capabilities to support factual verification in clinical text.
arXiv Detail & Related papers (2024-12-17T00:07:05Z)
The current status of large language models in summarizing radiology report impressions [13.402769727597812]
The effectiveness of large language models (LLMs) in summarizing radiology report impressions remains unclear. Three types of radiology reports, i.e., CT, PET-CT, and Ultrasound reports, are collected from Peking University Cancer Hospital and Institute. We use the report findings to construct the zero-shot, one-shot, and three-shot prompts with complete example reports to generate the impressions.
arXiv Detail & Related papers (2024-06-04T09:23:30Z)
FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE) FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation. Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process. AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Zero-shot Causal Graph Extrapolation from Text via LLMs [50.596179963913045]
We evaluate the ability of large language models (LLMs) to infer causal relations from natural language. LLMs show competitive performance in a benchmark of pairwise relations without needing (explicit) training samples. We extend our approach to extrapolating causal graphs through iterated pairwise queries.
arXiv Detail & Related papers (2023-12-22T13:14:38Z)
Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks. A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z)
Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports. We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM. We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Exploring Optimal Granularity for Extractive Summarization of Unstructured Health Records: Analysis of the Largest Multi-Institutional Archive of Health Records in Japan [25.195233641408233]
"Discharge summaries" are one promising application of the summarization. It remains unclear how the summaries should be generated from the unstructured source. This study aimed to identify the optimal granularity in summarization.
arXiv Detail & Related papers (2022-09-20T23:26:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.