MIMIC-RD: Can LLMs differentially diagnose rare diseases in real-world clinical settings?
- URL: http://arxiv.org/abs/2601.11559v1
- Date: Thu, 18 Dec 2025 05:03:49 GMT
- Title: MIMIC-RD: Can LLMs differentially diagnose rare diseases in real-world clinical settings?
- Authors: Zilal Eiz AlDin, John Wu, Jeffrey Paul Fung, Jennifer King, Mya Watts, Lauren ONeill, Adam Richard Cross, Jimeng Sun,
- Abstract summary: MIMIC-RD is a rare disease differential diagnosis benchmark constructed by directly mapping clinical text entities to Orphanet.<n>We evaluated various models on our dataset of 145 patients and found that current state-of-the-art LLMs perform poorly on rare disease differential diagnosis.
- Score: 11.16221457206157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rare diseases affecting 1 in 10 Americans, their differential diagnosis remains challenging. Due to their impressive recall abilities, large language models (LLMs) have been recently explored for differential diagnosis. Existing approaches to evaluating LLM-based rare disease diagnosis suffer from two critical limitations: they rely on idealized clinical case studies that fail to capture real-world clinical complexity, or they use ICD codes as disease labels, which significantly undercounts rare diseases since many lack direct mappings to comprehensive rare disease databases like Orphanet. To address these limitations, we explore MIMIC-RD, a rare disease differential diagnosis benchmark constructed by directly mapping clinical text entities to Orphanet. Our methodology involved an initial LLM-based mining process followed by validation from four medical annotators to confirm identified entities were genuine rare diseases. We evaluated various models on our dataset of 145 patients and found that current state-of-the-art LLMs perform poorly on rare disease differential diagnosis, highlighting the substantial gap between existing capabilities and clinical needs. From our findings, we outline several future steps towards improving differential diagnosis of rare diseases.
Related papers
- ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z) - Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D [6.480805458549629]
We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D.<n>We evaluate four state-of-the-art large language models (LLMs) on narrative-based diagnostic reasoning tasks.<n>Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement.
arXiv Detail & Related papers (2025-11-14T02:54:58Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [69.46279475491164]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv Detail & Related papers (2025-06-25T13:42:26Z) - Decoding Rarity: Large Language Models in the Diagnosis of Rare Diseases [1.9662978733004604]
Large language models (LLMs) have shown promising capabilities in transforming rare disease research.<n>This paper explores the integration of LLMs in the analysis of rare diseases, highlighting significant strides and pivotal studies.
arXiv Detail & Related papers (2025-05-18T15:42:15Z) - Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning [17.647875658030006]
This study investigates the underlying mechanisms behind the performance degradation phenomenon.<n>We developed a plug-and-play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities.<n>Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques.
arXiv Detail & Related papers (2025-02-24T06:24:20Z) - RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment [17.58261171394619]
Rare diseases collectively impact around 300 million people worldwide due to the vast number of diseases.<n>Recent agents powered by large language models (LLMs) have demonstrated notable applications across various domains.<n>RareAgents integrates advanced Multidisciplinary Team (MDT) coordination, memory mechanisms, and medical tools utilization, leveraging Llama-3.1-8B/70B as the base model.
arXiv Detail & Related papers (2024-12-17T02:22:24Z) - Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases.
We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases.
We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models.
Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z) - RareBench: Can LLMs Serve as Rare Diseases Specialists? [11.828142771893443]
Generalist Large Language Models (LLMs) have shown considerable promise in various domains, including medical diagnosis.
Rare diseases, affecting approximately 300 million people worldwide, often have unsatisfactory clinical diagnosis rates.
RareBench is a pioneering benchmark designed to evaluate the capabilities of LLMs on 4 critical dimensions within the realm of rare diseases.
We present an exhaustive comparative study of GPT-4's diagnostic capabilities against those of specialist physicians.
arXiv Detail & Related papers (2024-02-09T11:34:16Z) - Expert Uncertainty and Severity Aware Chest X-Ray Classification by
Multi-Relationship Graph Learning [48.29204631769816]
We re-extract disease labels from CXR reports to make them more realistic by considering disease severity and uncertainty in classification.
Our experimental results show that models considering disease severity and uncertainty outperform previous state-of-the-art methods.
arXiv Detail & Related papers (2023-09-06T19:19:41Z) - Inheritance-guided Hierarchical Assignment for Clinical Automatic
Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making.
We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.