Related papers: Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes

Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes

URL: http://arxiv.org/abs/2503.22092v1
Date: Fri, 28 Mar 2025 02:15:57 GMT
Title: Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes
Authors: Dina Albassam, Adam Cross, Chengxiang Zhai,
Abstract summary: Discharge summaries tend to provide more complete information, which can help infer accurate diagnoses.<n>This study investigates whether large language models (LLMs) can predict implicitly mentioned diagnoses from clinical notes and link them to corresponding medications.
Score: 21.43498764977656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Electronic Health Records (EHRs) often lack explicit links between medications and diagnoses, making clinical decision-making and research more difficult. Even when links exist, diagnosis lists may be incomplete, especially during early patient visits. Discharge summaries tend to provide more complete information, which can help infer accurate diagnoses, especially with the help of large language models (LLMs). This study investigates whether LLMs can predict implicitly mentioned diagnoses from clinical notes and link them to corresponding medications. We address two research questions: (1) Does majority voting across diverse LLM configurations outperform the best single configuration in diagnosis prediction? (2) How sensitive is majority voting accuracy to LLM hyperparameters such as temperature, top-p, and summary length? To evaluate, we created a new dataset of 240 expert-annotated medication-diagnosis pairs from 20 MIMIC-IV notes. Using GPT-3.5 Turbo, we ran 18 prompting configurations across short and long summary lengths, generating 8568 test cases. Results show that majority voting achieved 75 percent accuracy, outperforming the best single configuration at 66 percent. No single hyperparameter setting dominated, but combining deterministic, balanced, and exploratory strategies improved performance. Shorter summaries generally led to higher accuracy.In conclusion, ensemble-style majority voting with diverse LLM configurations improves diagnosis prediction in EHRs and offers a promising method to link medications and diagnoses in clinical texts.

Related papers

Sequential Diagnosis with Language Models [21.22416732642907]
We introduce the Sequential Diagnosis Benchmark, which transforms 304 diagnostically challenging cases into stepwise diagnostic encounters.<n>Performance is assessed not just by diagnostic accuracy but also by the cost of physician visits and tests performed.<n>We also present the MAI Diagnostic Orchestrator (MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians.
arXiv Detail & Related papers (2025-06-27T17:27:26Z)
An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [58.78045864541539]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv Detail & Related papers (2025-06-25T13:42:26Z)
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports [49.00805568780791]
We introduce MedCaseReasoning, the first open-access dataset for evaluating Large Language Models (LLMs) on their ability to align with clinician-authored diagnostic reasoning.<n>The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements.<n>We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning.
arXiv Detail & Related papers (2025-05-16T22:34:36Z)
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports. Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness [0.0]
Large Language Models (LLMs) offer promise for democratizing healthcare with advanced diagnostics. This study assesses their diagnostic reliability focusing on consistency, manipulation resilience, and contextual integration. LLMs' vulnerability to manipulation and limited contextual awareness pose challenges in clinical use.
arXiv Detail & Related papers (2025-03-02T11:50:16Z)
Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z)
Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints [8.547853819087043]
We evaluate the capability of general LLMs to identify and correct medical errors with multiple prompting strategies. We propose incorporating error-span predictions from a smaller, fine-tuned model in two ways. Our best-performing solution with 8-shot + CoT + hints ranked sixth in the shared task leaderboard.
arXiv Detail & Related papers (2024-05-28T10:20:29Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy [0.0]
Large language models (LLMs) are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults" We assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs.
arXiv Detail & Related papers (2024-02-13T21:24:21Z)
Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias [5.421033429862095]
Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. This study explores the role of large language models in mitigating these biases through the utilization of a multi-agent framework.
arXiv Detail & Related papers (2024-01-26T01:35:50Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)
SPeC: A Soft Prompt-Based Calibration on Performance Variability of Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.