Related papers: EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

URL: http://arxiv.org/abs/2601.03471v1
Date: Tue, 06 Jan 2026 23:49:10 GMT
Title: EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
Authors: Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin,
Abstract summary: EpiQAL is the first diagnostic benchmark for epidemiological question answering across diverse diseases.<n>Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control.
Score: 24.283535906312448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.

Related papers

RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis [11.973474883672282]
We propose RE-MCDF, a relation-enhanced multi-expert clinical diagnosis framework.<n>We show that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios.
arXiv Detail & Related papers (2026-02-01T15:53:27Z)
Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning [16.243806723551454]
Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment.<n>Their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations.<n>We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval.
arXiv Detail & Related papers (2026-01-26T11:03:00Z)
Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z)
PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis [13.503111478218434]
PathFound is an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis.<n> PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios.
arXiv Detail & Related papers (2025-12-29T15:34:27Z)
Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation [52.7583577508452]
Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning.<n>Their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images.<n>These challenges limit the effectiveness of conventionalSupervised Fine-Tuning strategies.
arXiv Detail & Related papers (2025-12-22T16:06:36Z)
Skin-R1: Toward Trustworthy Clinical Reasoning for Dermatological Diagnosis [27.666376727163073]
SkinR1 is a novel dermatological vision-language model (VLM) that combines deep, textbook-based reasoning with the broad generalization capabilities of reinforcement learning (RL)<n>SkinR1 systematically resolves the key challenges through a unified, end-to-end framework. First, we design a textbook-based reasoning generator that synthesizes high-fidelity, hierarchy-aware, and differential-diagnosis (DDx)-informed trajectories.<n>Second, we leverage the constructed trajectories for supervised fine-tuning (SFT) empowering the model with grounded reasoning ability. Third, we develop a novel RL paradigm that, by incorporating the
arXiv Detail & Related papers (2025-11-18T20:38:36Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z)
Revealing Multimodal Causality with Large Language Models [80.95511545591107]
We propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data.<n>It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes.<n>Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD.
arXiv Detail & Related papers (2025-09-22T13:45:17Z)
Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning [15.73558614478585]
We introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning.<n>Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces.<n>Our model achieves state-of-the-art performance on both basic and complex reasoning tasks.
arXiv Detail & Related papers (2025-08-22T06:47:30Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning [27.827761004918106]
This work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks.<n>With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%.
arXiv Detail & Related papers (2025-01-11T07:10:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.