MedScore: Factuality Evaluation of Free-Form Medical Answers
- URL: http://arxiv.org/abs/2505.18452v1
- Date: Sat, 24 May 2025 01:23:09 GMT
- Title: MedScore: Factuality Evaluation of Free-Form Medical Answers
- Authors: Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, Mark Dredze,
- Abstract summary: We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts.<n>Our method extracts up to three times more valid facts than existing methods.
- Score: 54.722181966548895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new approach to decomposing medical answers into condition-aware valid facts. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score significantly varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation.
Related papers
- Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine [59.604255567812714]
We show how experts verify real claims from social media by synthesizing medical evidence.<n>Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials.<n>We argue that fact-checking should be approached and evaluated as an interactive communication problem.
arXiv Detail & Related papers (2025-06-25T22:58:08Z) - Medical Hallucinations in Foundation Models and Their Impact on Healthcare [53.97060824532454]
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine.<n>We define medical hallucination as any instance in which a model generates misleading medical content.<n>Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates.<n>These findings underscore the ethical and practical imperative for robust detection and mitigation strategies.
arXiv Detail & Related papers (2025-02-26T02:30:44Z) - Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.<n>Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.<n>We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - FactEHR: A Dataset for Evaluating Factuality in Clinical Notes Using LLMs [3.919419934122265]
We present FactEHR, an NLI dataset consisting of document fact decompositions for 2,168 clinical notes spanning four types from three hospital systems.<n>We assess the generated facts on different axes, from entailment evaluation of LLMs to a qualitative analysis.<n>The results underscore the need for better LLM capabilities to support factual verification in clinical text.
arXiv Detail & Related papers (2024-12-17T00:07:05Z) - OLAPH: Improving Factuality in Biomedical Long-form Question Answering [15.585833125854418]
We introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain.
We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation.
Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality.
arXiv Detail & Related papers (2024-05-21T11:50:16Z) - FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts.
It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs)
We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z) - Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations [63.90357081534995]
Long-form generations from large language models (LLMs) contain a mix of factual and non-factual claims.
We show that strong open-source models like Llama-chat can generate paragraphs that contain verifiable facts, but the facts are combined into a non-factual paragraph due to entity ambiguity.
We introduce an enhanced metric, D-FActScore, specifically designed for content with ambiguous entities.
arXiv Detail & Related papers (2024-02-08T12:36:29Z) - Extrinsically-Focused Evaluation of Omissions in Medical Summarization [9.847304366680772]
Large language models (LLMs) have shown promise in safety-critical applications such as healthcare, yet the ability to quantify performance has lagged.
We propose MED-OMIT as a metric to explore the challenge of evaluating a summary of a patient's medical record.
arXiv Detail & Related papers (2023-11-14T16:46:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.