Related papers: MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

URL: http://arxiv.org/abs/2412.18947v4
Date: Fri, 28 Mar 2025 23:37:45 GMT
Title: MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models
Authors: Kaiwen Zuo, Yirui Jiang,
Abstract summary: Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications.<n>Their propensity for hallucinations presents substantial risks to patient care.<n>This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications, yet their propensity for hallucinations -- generating medically implausible or inaccurate information -- presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs. Our methodology integrates expert-validated medical case scenarios with established medical databases to create a robust evaluation dataset. The framework employs a sophisticated measurement system that combines automated ACHMI (Automatic Caption Hallucination Measurement in Medical Imaging) scoring with rigorous clinical expert evaluations and utilizes reinforcement learning methods to achieve automatic annotation. Through an optimized reinforcement learning from human feedback (RLHF) training pipeline specifically designed for medical applications, MedHallBench enables thorough evaluation of MLLMs across diverse clinical contexts while maintaining stringent accuracy standards. We conducted comparative experiments involving various models, utilizing the benchmark to establish a baseline for widely adopted large language models (LLMs). Our findings indicate that ACHMI provides a more nuanced understanding of the effects of hallucinations compared to traditional metrics, thereby highlighting its advantages in hallucination assessment. This research establishes a foundational framework for enhancing MLLMs' reliability in healthcare settings and presents actionable strategies for addressing the critical challenge of AI hallucinations in medical applications.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models [15.91764739198419]
We present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics.<n>Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints without accessing real-world electronic health records.
arXiv Detail & Related papers (2026-01-06T13:56:33Z)
Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs [12.287636586297756]
The recent proliferation of large language models (LLMs) holds the potential to revolutionize healthcare.<n>This paper investigates the reliability and viability of using medical knowledge graphs (KGs) for the automated factuality evaluation of LLM-generated responses.<n>We introduce FAITH, a framework designed to probe the strengths and limitations of this KG-based approach.
arXiv Detail & Related papers (2025-11-16T22:58:22Z)
Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models [15.530083855947987]
We propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR.<n>Med-RwR actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning.<n> Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models.
arXiv Detail & Related papers (2025-10-21T05:18:18Z)
Automating Expert-Level Medical Reasoning Evaluation of Large Language Models [26.702477426812333]
We introduce MedThink-Bench, a benchmark for rigorous, explainable, and scalable assessment of large language models' medical reasoning.<n>We also propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms.<n>Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.
arXiv Detail & Related papers (2025-07-10T17:58:26Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance. We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning. Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z)
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications [2.838746648891565]
We introduce MEDIC, a framework assessing Large Language Models (LLMs) across five critical dimensions of clinical competence. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths.
arXiv Detail & Related papers (2024-09-11T14:44:51Z)
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge. We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z)
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models [22.30139330566514]
Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications. LVLMs inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts. We introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection.
arXiv Detail & Related papers (2024-06-14T17:14:22Z)
A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes [17.648021186810663]
The purpose of this study was to evaluate the performance of Large Language Models (LLMs) in understanding and processing real-world clinical notes. The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities.
arXiv Detail & Related papers (2024-01-24T16:52:37Z)
An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models [22.409334091186995]
Large language models (LLMs) often suffer from hallucinations, leading to overly confident but incorrect judgments. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations.
arXiv Detail & Related papers (2023-09-05T09:24:48Z)
Large Language Models for Biomedical Knowledge Graph Construction: Information extraction from EMR notes [0.0]
We propose an end-to-end machine learning solution based on large language models (LLMs) The entities used in the KG construction process are diseases, factors, treatments, as well as manifestations that coexist with the patient while experiencing the disease. The application of the proposed methodology is demonstrated on age-related macular degeneration.
arXiv Detail & Related papers (2023-01-29T15:52:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.