COPE: Chain-Of-Thought Prediction Engine for Open-Source Large Language Model Based Stroke Outcome Prediction from Clinical Notes
- URL: http://arxiv.org/abs/2512.02499v1
- Date: Tue, 02 Dec 2025 07:44:20 GMT
- Title: COPE: Chain-Of-Thought Prediction Engine for Open-Source Large Language Model Based Stroke Outcome Prediction from Clinical Notes
- Authors: Yongkai Liu, Helena Feng, Bin Jiang, Yixin Wang, Max Wintermark, David S. Liebeskind, Michael Moseley, Maarten Lansberg, Gregory Albers, Jeremy Heit, Greg Zaharchuk,
- Abstract summary: Chain-of-Thought (CoT) Outcome Prediction Engine (COPE) is a reasoning-enhanced large language model framework for predicting outcomes from unstructured clinical notes.<n>This study included 464 acute ischemic stroke (AIS) patients with discharge summaries and 90-day modified Rankin Scale (mRS) scores.<n>COPE achieved an MAE of 1.01 ( 95% CI 0.92-1.11), +/-1 accuracy of 74.4% (69.9, 78.8%), and exact accuracy of 32.8% (28.0, 37.6%).
- Score: 23.044580867637105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting outcomes in acute ischemic stroke (AIS) guides clinical decision-making, patient counseling, and resource allocation. Clinical notes contain rich contextual information, but their unstructured nature limits their use in traditional predictive models. We developed and evaluated the Chain-of-Thought (CoT) Outcome Prediction Engine (COPE), a reasoning-enhanced large language model framework, for predicting 90-day functional outcomes after AIS from unstructured clinical notes. This study included 464 AIS patients with discharge summaries and 90-day modified Rankin Scale (mRS) scores. COPE uses a two-step CoT framework based on sequential open-source LLaMA-3-8B models: the first generates clinical reasoning, and the second outputs an mRS prediction. We compared COPE with GPT-4.1, ClinicalBERT, a structured variable-based machine learning model (Clinical ML), and a single-step LLM without CoT. Performance was evaluated using mean absolute error (MAE), accuracy within +/-1 mRS point, and exact accuracy. COPE achieved an MAE of 1.01 (95% CI 0.92-1.11), +/-1 accuracy of 74.4% (69.9, 78.8%), and exact accuracy of 32.8% (28.0, 37.6%), comparable to GPT-4.1 and superior to ClinicalBERT [MAE 1.24 (1.13-1.36)], Clinical ML [1.28 (1.18-1.39)], and the single-step LLM [1.20 (1.09-1.33)]. Subgroup analyses showed consistent performance across sex and age, with slightly higher error among older patients, those undergoing thrombectomy, and those with longer summaries. These findings demonstrate that COPE, a lightweight, interpretable, and privacy-preserving open-source framework, provides an accurate and practical solution for outcome prediction from unstructured clinical text.
Related papers
- Explainable Admission-Level Predictive Modeling for Prolonged Hospital Stay in Elderly Populations: Challenges in Low- and Middle-Income Countries [65.4286079244589]
Prolonged length of stay (pLoS) is a significant factor associated with the risk of adverse in-hospital events.<n>We develop and explain a predictive model for pLos using admission-level patient and hospital administrative data.
arXiv Detail & Related papers (2026-01-07T23:35:24Z) - Knowledge Graph Augmented Large Language Models for Disease Prediction [24.992170033802537]
Knowledge graph (KG)-guided chain-of-thought (CoT) framework generates clinically grounded reasoning for visit-level disease prediction in MIMIC-III.<n> ICD-9 codes are mapped to PrimeKG, from which disease-relevant nodes and multi-hop reasoning paths are extracted and used as scaffolds for CoT generation.<n> KG-guided models outperform strong classical baselines, achieving AUROC values of 0.66 to 0.70 and macro-AUPR values of 0.40 to 0.47.<n>A blinded clinician evaluation shows consistent preference for KG-guided CoT explanations in clarity, relevance, and clinical correctness.
arXiv Detail & Related papers (2025-12-01T02:49:17Z) - Cost-Aware Prediction (CAP): An LLM-Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction [2.6045133848979214]
This paper introduces a cost-aware prediction (CAP) framework that combines cost-benefit analysis assisted by large language model (LLM) agents.<n>We developed an ML model predicting 1-year mortality in patients with heart failure to identify those eligible for home care.
arXiv Detail & Related papers (2025-11-19T11:34:47Z) - Enhancing mortality prediction in cardiac arrest ICU patients through meta-modeling of structured clinical data from MIMIC-IV [0.0]
This study develops and evaluates machine learning models that integrate structured clinical data and unstructured information.<n>We used LASSO and XGBoost for feature selection, followed by a logistic regression trained on the top features identified by both models.<n>The final logistic regression model, which combined structured and textual input, achieved an AUC of 0.918, compared to 0.753 when using structured data alone, a relative improvement 22%.
arXiv Detail & Related papers (2025-10-20T20:56:45Z) - AUTOCT: Automating Interpretable Clinical Trial Prediction with LLM Agents [47.640779069547534]
AutoCT is a novel framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning.<n>We show that AutoCT performs on par with or better than SOTA methods on clinical trial prediction tasks within only a limited number of self-refinement iterations.
arXiv Detail & Related papers (2025-06-04T11:50:55Z) - Predicting Length of Stay in Neurological ICU Patients Using Classical Machine Learning and Neural Network Models: A Benchmark Study on MIMIC-IV [49.1574468325115]
This study explores multiple ML approaches for predicting LOS in ICU specifically for the patients with neurological diseases based on the MIMIC-IV dataset.<n>The evaluated models include classic ML algorithms (K-Nearest Neighbors, Random Forest, XGBoost and CatBoost) and Neural Networks (LSTM, BERT and Temporal Fusion Transformer)
arXiv Detail & Related papers (2025-05-23T14:06:42Z) - ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Explainable AI for Mental Health Emergency Returns: Integrating LLMs with Predictive Modeling [2.466324275447403]
Emergency department (ED) returns for mental health conditions pose a major healthcare burden, with 24-27% of patients returning within 30 days.<n>To assess whether integrating large language models (LLMs) with machine learning improves predictive accuracy and clinical interpretability of ED mental health return risk models.
arXiv Detail & Related papers (2025-01-21T15:41:20Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.<n>Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.