Related papers: CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset

CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset

URL: http://arxiv.org/abs/2503.06204v1
Date: Sat, 08 Mar 2025 13:21:44 GMT
Title: CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset
Authors: Oriel Perets, Ofir Ben Shoham, Nir Grinberg, Nadav Rappoport,
Abstract summary: General-purpose GPT-4o attains the best performance in both the multiple-choice task and the open-ended task.<n>General-purpose GPT-4o attains the best performance in both the multiple-choice task and the open-ended task.
Score: 0.807662398486908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical benchmark datasets significantly contribute to developing Large Language Models (LLMs) for medical knowledge extraction, diagnosis, summarization, and other uses. Yet, current benchmarks are mainly derived from exam questions given to medical students or cases described in the medical literature, lacking the complexity of real-world patient cases that deviate from classic textbook abstractions. These include rare diseases, uncommon presentations of common diseases, and unexpected treatment responses. Here, we construct Clinically Uncommon Patient Cases and Diagnosis Dataset (CUPCase) based on 3,562 real-world case reports from BMC, including diagnoses in open-ended textual format and as multiple-choice options with distractors. Using this dataset, we evaluate the ability of state-of-the-art LLMs, including both general-purpose and Clinical LLMs, to identify and correctly diagnose a patient case, and test models' performance when only partial information about cases is available. Our findings show that general-purpose GPT-4o attains the best performance in both the multiple-choice task (average accuracy of 87.9%) and the open-ended task (BERTScore F1 of 0.764), outperforming several LLMs with a focus on the medical domain such as Meditron-70B and MedLM-Large. Moreover, GPT-4o was able to maintain 87% and 88% of its performance with only the first 20% of tokens of the case presentation in multiple-choice and free text, respectively, highlighting the potential of LLMs to aid in early diagnosis in real-world cases. CUPCase expands our ability to evaluate LLMs for clinical decision support in an open and reproducible manner.

Related papers

ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases [22.49042417871628]
Hepato-pancreato-biliary (HPB) disorders represent a global public health challenge due to their high morbidity and mortality.<n>Although large language models (LLMs) have shown promising performance in general medical question-answering tasks, the current evaluation benchmarks are mostly derived from standardized examinations or manually designed questions.<n>We systematically eatablish an HPB disease evaluation benchmark comprising 3,535 closed-ended multiple-choice questions and 337 open-ended real diagnosis cases.
arXiv Detail & Related papers (2025-05-30T11:35:05Z)
CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports [4.477840500181267]
We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs.<n>We assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration.<n>Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management.
arXiv Detail & Related papers (2025-05-22T20:21:32Z)
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports [49.00805568780791]
We introduce MedCaseReasoning, the first open-access dataset for evaluating Large Language Models (LLMs) on their ability to align with clinician-authored diagnostic reasoning.<n>The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements.<n>We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning.
arXiv Detail & Related papers (2025-05-16T22:34:36Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z)
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios. The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge. We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
PRISM: Patient Records Interpretation for Semantic Clinical Trial Matching using Large Language Models [4.438101430231511]
We present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials.
arXiv Detail & Related papers (2024-04-23T22:33:19Z)
LongHealth: A Question Answering Benchmark with Long Clinical Documents [36.05587855811346]
We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting. We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
arXiv Detail & Related papers (2024-01-25T19:57:00Z)
Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven. In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks. A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.