MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification
- URL: http://arxiv.org/abs/2505.18452v2
- Date: Sun, 19 Oct 2025 02:13:41 GMT
- Title: MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification
- Authors: Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze,
- Abstract summary: We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora.<n>Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts.
- Score: 51.82420076479152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score substantially varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation by using our generalizable and modularized pipeline for domain adaptation.
Related papers
- AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment [0.0]
We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts.<n>Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology.<n>We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training.
arXiv Detail & Related papers (2025-12-03T10:14:31Z) - MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts [70.64143198545031]
We propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance.<n>Our results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics.
arXiv Detail & Related papers (2025-10-15T12:50:33Z) - Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine [59.604255567812714]
We show how experts verify real claims from social media by synthesizing medical evidence.<n>Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials.<n>We argue that fact-checking should be approached and evaluated as an interactive communication problem.
arXiv Detail & Related papers (2025-06-25T22:58:08Z) - PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization [5.5899921245557]
Hallucinated outputs from large language models pose risks in the medical domain.<n>We introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact.
arXiv Detail & Related papers (2025-03-11T20:59:53Z) - Medical Hallucinations in Foundation Models and Their Impact on Healthcare [53.97060824532454]
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine.<n>We define medical hallucination as any instance in which a model generates misleading medical content.<n>Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates.<n>These findings underscore the ethical and practical imperative for robust detection and mitigation strategies.
arXiv Detail & Related papers (2025-02-26T02:30:44Z) - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.<n>Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.<n>We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - FactEHR: A Dataset for Evaluating Factuality in Clinical Notes Using LLMs [3.919419934122265]
We present FactEHR, an NLI dataset consisting of document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems.<n>We assess the generated facts on different axes, from entailment evaluation of LLMs to a qualitative analysis.<n>The results underscore the need for better LLM capabilities to support factual verification in clinical text.
arXiv Detail & Related papers (2024-12-17T00:07:05Z) - OLAPH: Improving Factuality in Biomedical Long-form Question Answering [15.585833125854418]
We introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain.
We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation.
Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality.
arXiv Detail & Related papers (2024-05-21T11:50:16Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts.
It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs)
We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z) - Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations [63.90357081534995]
Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims.
We show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity.
We introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities.
arXiv Detail & Related papers (2024-02-08T12:36:29Z) - Extrinsically-Focused Evaluation of Omissions in Medical Summarization [9.847304366680772]
Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged.
We propose MED-OMIT as a metric to explore the challenge of evaluating a summary of a patient's medical record.
arXiv Detail & Related papers (2023-11-14T16:46:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.