Related papers: Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports

Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports

URL: http://arxiv.org/abs/2410.09234v1
Date: Fri, 11 Oct 2024 20:16:25 GMT
Title: Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports
Authors: Luoyao Chen, Revant Teotia, Antonio Verdone, Aidan Cardall, Lakshay Tyagi, Yiqiu Shen, Sumit Chopra,
Abstract summary: This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports. evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1%, which is on par with GPT-4.
Score: 1.5972172622800358
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Radiology reports summarize key findings and differential diagnoses derived from medical imaging examinations. The extraction of differential diagnoses is crucial for downstream tasks, including patient management and treatment planning. However, the unstructured nature of these reports, characterized by diverse linguistic styles and inconsistent formatting, presents significant challenges. Although proprietary large language models (LLMs) such as GPT-4 can effectively retrieve clinical information, their use is limited in practice by high costs and concerns over the privacy of protected health information (PHI). This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports. We first utilize GPT-4 to create 31,056 labeled reports, then fine-tune open source LLM using this dataset. Evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1\%, which is on par with GPT-4 (90.8\%). Through this study, we provide a methodology for constructing in-house LLMs that: match the performance of GPT, reduce dependence on expensive proprietary models, and enhance the privacy and security of PHI.

Related papers

Paging Dr. GPT: Extracting Information from Clinical Notes to Enhance Patient Predictions [0.25165775267615204]
We investigate how answers generated by GPT-4o-mini (ChatGPT) to simple clinical questions about patients can support patient-level mortality prediction. Using data from 14,011 first-time admissions to the Coronary Care or Cardiovascular Intensive Care Units in the MIMIC-IV Note dataset, we implement a transparent framework that uses GPT responses as input features in logistic regression models.
arXiv Detail & Related papers (2025-04-14T17:41:45Z)
Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease [0.7696359453385685]
This paper generates synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset. Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset. Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.
arXiv Detail & Related papers (2024-11-12T15:28:06Z)
BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports [9.739220217225435]
This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports. We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it. Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4's performance but also offers cost reductions and enhanced data privacy.
arXiv Detail & Related papers (2024-08-21T04:33:05Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama. It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2. Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z)
Classifying Cancer Stage with Open-Source Clinical Large Language Models [0.35998666903987897]
Open-source clinical large language models (LLMs) can extract pathologic tumor-node-metastasis (pTNM) staging information from real-world pathology reports. Our findings suggest that while LLMs still exhibit subpar performance in Tumor (T) classification, with the appropriate adoption of prompting strategies, they can achieve comparable performance on Metastasis (M) and improved performance on Node (N) classification.
arXiv Detail & Related papers (2024-04-02T02:30:47Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports [22.599250713630333]
Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs) Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
arXiv Detail & Related papers (2024-01-29T21:24:43Z)
Distilling Large Language Models for Matching Patients to Clinical Trials [3.4068841624198942]
The recent success of large language models (LLMs) has paved the way for their adoption in the high-stakes domain of healthcare. This study presents the first systematic examination of the efficacy of both proprietary (GPT-3.5, and GPT-4) and open-source LLMs (LLAMA 7B,13B, and 70B) for the task of patient-trial matching. Our findings reveal that open-source LLMs, when fine-tuned on this limited and synthetic dataset, demonstrate performance parity with their proprietary counterparts.
arXiv Detail & Related papers (2023-12-15T17:11:07Z)
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations. The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations. ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain. We annotated a corpus of clinical documents according to 12 types of identifying entities. We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.