Related papers: Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding

Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding

URL: http://arxiv.org/abs/2501.05479v1
Date: Tue, 07 Jan 2025 17:11:12 GMT
Title: Practical Design and Benchmarking of Generative AI Applications for Surgical Billing and Coding
Authors: John C. Rollman, Bruce Rogers, Hamed Zaribafzadeh, Daniel Buckland, Ursula Rogers, Jennifer Gagnon, Ozanan Meireles, Lindsay Jennings, Jim Bennett, Jennifer Nicholson, Nandan Lad, Linda Cendales, Andreas Seas, Alessandro Martinino, E. Shelley Hwang, Allan D. Kirk,
Abstract summary: We present a strategy for developing generative AI tools for medical billing and coding.<n>Our study shows that a small model that is fine-tuned on domain-specific data performs as well as the larger contemporary consumer models.
Score: 27.93881956637585
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Background: Healthcare has many manual processes that can benefit from automation and augmentation with Generative Artificial Intelligence (AI), the medical billing and coding process. However, current foundational Large Language Models (LLMs) perform poorly when tasked with generating accurate International Classification of Diseases, 10th edition, Clinical Modification (ICD-10-CM) and Current Procedural Terminology (CPT) codes. Additionally, there are many security and financial challenges in the application of generative AI to healthcare. We present a strategy for developing generative AI tools in healthcare, specifically for medical billing and coding, that balances accuracy, accessibility, and patient privacy. Methods: We fine tune the PHI-3 Mini and PHI-3 Medium LLMs using institutional data and compare the results against the PHI-3 base model, a PHI-3 RAG application, and GPT-4o. We use the post operative surgical report as input and the patients billing claim the associated ICD-10, CPT, and Modifier codes as the target result. Performance is measured by accuracy of code generation, proportion of invalid codes, and the fidelity of the billing claim format. Results: Both fine-tuned models performed better or as well as GPT-4o. The Phi-3 Medium fine-tuned model showed the best performance (ICD-10 Recall and Precision: 72%, 72%; CPT Recall and Precision: 77%, 79%; Modifier Recall and Precision: 63%, 64%). The Phi-3 Medium fine-tuned model only fabricated 1% of ICD-10 codes and 0.6% of CPT codes generated. Conclusions: Our study shows that a small model that is fine-tuned on domain-specific data for specific tasks using a simple set of open-source tools and minimal technological and monetary requirements performs as well as the larger contemporary consumer models.

Related papers

Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs [7.026393789313748]
This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries.<n> Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall.
arXiv Detail & Related papers (2025-07-02T00:53:54Z)
Unlocking Historical Clinical Trial Data with ALIGN: A Compositional Large Language Model System for Medical Coding [44.01429184037945]
We introduce ALIGN, a novel compositional LLM-based system for automated, zero-shot medical coding. We evaluate ALIGN on harmonizing medication terms into Anatomical Therapeutic Chemical (ATC) and medical history terms into Medical Dictionary for Regulatory Activities (MedDRA) codes.
arXiv Detail & Related papers (2024-11-20T09:59:12Z)
Improving ICD coding using Chapter based Named Entities and Attentional Models [0.0]
We introduce an enhanced approach to ICD coding that improves F1 scores by using chapter-based named entities and attentional models. This method categorizes discharge summaries into ICD-9 Chapters and develops attentional models with chapter-specific data. For categorization, we use Chapter-IV to de-bias and influence key entities and weights without neural networks.
arXiv Detail & Related papers (2024-07-24T12:34:23Z)
Large language models are good medical coders, if provided with tools [0.0]
This study presents a novel two-stage Retrieve-Rank system for automated ICD-10-CM medical coding. evaluating both systems on a dataset of 100 single-term medical conditions. The Retrieve-Rank system achieved 100% accuracy in predicting correct ICD-10-CM codes.
arXiv Detail & Related papers (2024-07-06T06:58:51Z)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO. Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
Can GPT-3.5 Generate and Code Discharge Summaries? [45.633849969788315]
We generated and coded 9,606 discharge summaries based on lists of ICD-10 code descriptions. Neural coding models were trained on baseline and augmented data. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families.
arXiv Detail & Related papers (2024-01-24T15:10:13Z)
Automated clinical coding using off-the-shelf large language models [10.365958121087305]
The task of assigning diagnostic ICD codes to patient hospital admissions is typically performed by expert human coders. Efforts towards automated ICD coding are dominated by supervised deep learning models. In this work, we leverage off-the-shelf pre-trained generative large language models to develop a practical solution.
arXiv Detail & Related papers (2023-10-10T11:56:48Z)
Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study [60.56194508762205]
We reproduce, compare, and analyze state-of-the-art automated medical coding machine learning models. We show that several models underperform due to weak configurations, poorly sampled train-test splits, and insufficient evaluation. We present the first comprehensive results on the newly released MIMIC-IV dataset using the reproduced models.
arXiv Detail & Related papers (2023-04-21T11:54:44Z)
ICDBigBird: A Contextual Embedding Model for ICD Code Classification [71.58299917476195]
Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks. ICDBigBird is a BigBird-based model which can integrate a Graph Convolutional Network (GCN) Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task.
arXiv Detail & Related papers (2022-04-21T20:59:56Z)
Collaborative residual learners for automatic icd10 prediction using prescribed medications [45.82374977939355]
We propose a novel collaborative residual learning based model to automatically predict ICD10 codes employing only prescriptions data. We obtain multi-label classification accuracy of 0.71 and 0.57 of average precision, 0.57 and 0.38 of F1-score and 0.73 and 0.44 of accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:07:27Z)
Multi-label natural language processing to identify diagnosis and procedure codes from MIMIC-III inpatient notes [0.0]
In the United States, 25% or greater than 200 billion dollars of hospital spending accounts for administrative costs that involve medical coding and billing. Natural language processing can automate the extraction of codes/labels from unstructured clinical notes. Our model achieved an overall accuracy of 87.08%, an F1 score of 85.82%, and an AUC of 91.76% for top-10 codes.
arXiv Detail & Related papers (2020-03-17T02:56:27Z)
Natural language processing of MIMIC-III clinical notes for identifying diagnosis and procedures with neural networks [0.0]
We report the performance of a natural language processing model that can map clinical notes to medical codes. We employed state-of-the-art deep learning method, ULMFiT on the largest emergency department clinical notes dataset MIMIC III. Our models were able to predict the top-10 diagnoses and procedures with 80.3% and 80.5% accuracy, whereas the top-50 ICD-9 codes of diagnosis and procedures are predicted with 70.7% and 63.9% accuracy.
arXiv Detail & Related papers (2019-12-28T04:05:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.