MedVAL: Toward Expert-Level Medical Text Validation with Language Models
- URL: http://arxiv.org/abs/2507.03152v4
- Date: Thu, 18 Sep 2025 04:11:49 GMT
- Title: MedVAL: Toward Expert-Level Medical Text Validation with Language Models
- Authors: Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari,
- Abstract summary: There is an immediate need to evaluate the accuracy and safety of LM-generated medical text.<n>Currently, such evaluation relies on solely manual physician review.<n>We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluators.
- Score: 19.885282576644077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.
Related papers
- When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation [18.338933046286257]
Large language models (LLMs) are increasingly employed to address diverse problems, including medical queries.<n>LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users.<n>This paper focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions.
arXiv Detail & Related papers (2026-02-27T21:09:43Z) - A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models [7.8780007697387235]
We introduce BioPulse-QA, a benchmark that evaluates large language models (LLMs) on answering questions from newly published biomedical documents.<n>We evaluate four LLMs - GPT-o1, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents.
arXiv Detail & Related papers (2026-01-19T00:38:33Z) - EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z) - Arabic Large Language Models for Medical Text Generation [0.5483130283061118]
This study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation.<n>The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input.
arXiv Detail & Related papers (2025-09-12T09:37:26Z) - Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z) - LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling [4.548755617115687]
We propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task.<n>Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data.<n>In extensive experiments across WMT22-24 shared tasks, ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation.
arXiv Detail & Related papers (2025-04-18T11:11:14Z) - TheBlueScrubs-v1, a comprehensive curated medical dataset derived from the internet [1.4043931310479378]
TheBlueScrubs-v1 is a curated dataset of over 25 billion medical tokens drawn from a broad-scale internet corpus.<n>Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards.<n>This Data Descriptor details the dataset's creation and validation, underscoring its potential utility for medical AI research.
arXiv Detail & Related papers (2025-04-01T22:25:19Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models [34.81544597731073]
We introduce ACE-$M3$, an open-sourced textbfAutomatic textbfCapability textbfEvaluator for textbfMultimodal textbfMedical textbfModels.<n>It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria.
arXiv Detail & Related papers (2024-12-16T05:15:43Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese
Medical Exam Dataset [31.047827145874844]
We introduce CMExam, sourced from the Chinese National Medical Licensing Examination.
CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner.
For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels.
arXiv Detail & Related papers (2023-06-05T16:48:41Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.