Related papers: MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

URL: http://arxiv.org/abs/2509.12440v1
Date: Mon, 15 Sep 2025 20:46:21 GMT
Title: MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Authors: Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai,
Abstract summary: We introduce MedFact, a new benchmark for Chinese medical fact-checking.<n>It comprises 2,116 expert-annotated instances curated from diverse real-world texts.<n>It employs a hybrid AI-human framework where expert feedback refines an AI-driven, multi-criteria filtering process.
Score: 4.809421212365958
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism'' phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.

Related papers

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation [23.74179903717012]
We introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports.<n>This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise.<n>A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings.
arXiv Detail & Related papers (2026-02-28T02:50:20Z)
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation [18.338933046286257]
Large language models (LLMs) are increasingly employed to address diverse problems, including medical queries.<n>LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users.<n>This paper focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions.
arXiv Detail & Related papers (2026-02-27T21:09:43Z)
MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models [5.253788190589279]
We present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing.<n> MedMKEB is built on a high-quality medical visual question-answering dataset and enriched with carefully constructed editing tasks.<n>We incorporate human expert validation to ensure the accuracy and reliability of the benchmark.
arXiv Detail & Related papers (2025-08-07T07:09:26Z)
Towards Domain Specification of Embedding Models in Medicine [1.0713888959520208]
We propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB)<n>Our results demonstrate that this combined approach establishes a robust evaluation framework and yields embeddings that consistently outperform state of the art alternatives in different tasks.
arXiv Detail & Related papers (2025-07-25T16:15:00Z)
MIRA: A Novel Framework for Fusing Modalities in Medical RAG [6.044279952668295]
We introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM.<n>MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning.
arXiv Detail & Related papers (2025-07-10T16:33:50Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models [35.60385437194243]
Current Medical Large Vision Language Models (Med-LVLMs) frequently encounter factual issues. RAG, which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. We propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the selection of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model.
arXiv Detail & Related papers (2024-07-06T16:45:07Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.