Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs
- URL: http://arxiv.org/abs/2507.03001v1
- Date: Wed, 02 Jul 2025 00:53:54 GMT
- Title: Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs
- Authors: Akram Mustafa, Usman Naseem, Mostafa Rahimi Azghadi,
- Abstract summary: This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries.<n> Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall.
- Score: 7.026393789313748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.
Related papers
- In-Context Learning for Label-Efficient Cancer Image Classification in Oncology [1.741659712094955]
In-context learning (ICL) is a pragmatic alternative to model retraining for domain-specific diagnostic tasks.<n>We evaluated the performance of four vision-language models (VLMs)-Paligemma, CLIP, ALIGN and GPT-4o.<n>ICL demonstrated competitive gains despite their smaller size, suggesting feasibility for deployment in computing constrained clinical environments.
arXiv Detail & Related papers (2025-05-08T20:49:01Z) - Can Reasoning LLMs Enhance Clinical Document Classification? [7.026393789313748]
Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task.<n>This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat)<n>Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%)
arXiv Detail & Related papers (2025-04-10T18:00:27Z) - Can GPT-3.5 Generate and Code Discharge Summaries? [45.633849969788315]
We generated and coded 9,606 discharge summaries based on lists of ICD-10 code descriptions.
Neural coding models were trained on baseline and augmented data.
We report micro- and macro-F1 scores on the full codeset, generation codes, and their families.
arXiv Detail & Related papers (2024-01-24T15:10:13Z) - Automated clinical coding using off-the-shelf large language models [10.365958121087305]
The task of assigning diagnostic ICD codes to patient hospital admissions is typically performed by expert human coders.
Efforts towards automated ICD coding are dominated by supervised deep learning models.
In this work, we leverage off-the-shelf pre-trained generative large language models to develop a practical solution.
arXiv Detail & Related papers (2023-10-10T11:56:48Z) - Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review
and Replicability Study [60.56194508762205]
We reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models.
We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation.
We present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models.
arXiv Detail & Related papers (2023-04-21T11:54:44Z) - Do We Still Need Clinical Language Models? [15.023633270864675]
We show that relatively small specialized clinical models substantially outperform all in-context learning approaches.
We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
arXiv Detail & Related papers (2023-02-16T05:08:34Z) - ICDBigBird: A Contextual Embedding Model for ICD Code Classification [71.58299917476195]
Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks.
ICDBigBird is a BigBird-based model which can integrate a Graph Convolutional Network (GCN)
Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task.
arXiv Detail & Related papers (2022-04-21T20:59:56Z) - A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding.
These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information.
Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z) - Collaborative residual learners for automatic icd10 prediction using
prescribed medications [45.82374977939355]
We propose a novel collaborative residual learning based model to automatically predict ICD10 codes employing only prescriptions data.
We obtain multi-label classification accuracy of 0.71 and 0.57 of average precision, 0.57 and 0.38 of F1-score and 0.73 and 0.44 of accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:07:27Z) - Ensemble model for pre-discharge icd10 coding prediction [45.82374977939355]
We propose an ensemble model incorporating multiple clinical data sources for accurate code predictions.
We obtain multi-label classification accuracies of 0.73 and 0.58 for average precision, 0.56 and 0.35 for F1-scores and 0.71 and 0.4 accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.