Related papers: Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

URL: http://arxiv.org/abs/2507.19973v1
Date: Sat, 26 Jul 2025 15:02:32 GMT
Title: Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization
Authors: Ebrahim Rasromani, Stella K. Kang, Yanqi Xu, Beisong Liu, Garvit Luhadia, Wan Fung Chui, Felicia L. Pasadyn, Yu Chih Hung, Julie Y. An, Edwin Mathieu, Zehui Gu, Carlos Fernandez-Granda, Ammar A. Javed, Greg D. Sacks, Tamas Gonda, Chenchan Huang, Yiqiu Shen,
Abstract summary: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive.<n>To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports.
Score: 9.840625513935343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss' Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss' Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss' Kappa = 0.893) or GPT-CoT (Fleiss' Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches [5.958100741754613]
We evaluated large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas.<n>We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning.<n>The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79.
arXiv Detail & Related papers (2025-12-05T08:49:57Z)
Pillar-0: A New Frontier for Radiology Foundation Models [41.640120966890954]
We introduce Pillar-0, a radiology foundation model pretrained on 42,990 abdomen-pelvis CTs, 86,411 chest CTs, 14,348 head CTs, and 11,543 breast MRIs.<n>Pillar-0 achieves mean AUROCs of 86.4, 88.0, 90.1, and 82.9, outperforming MedGemma (Google), MedImageInsight (Microsoft), Lingshu (Alibaba), and Merlin (Stanford) by 7.8-15.8 AUROC points and ranking best in 87.2% (319/366) tasks.
arXiv Detail & Related papers (2025-11-21T21:50:34Z)
A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool [5.585587545595609]
The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool.
arXiv Detail & Related papers (2025-10-30T13:50:19Z)
Handcrafted vs. Deep Radiomics vs. Fusion vs. Deep Learning: A Comprehensive Review of Machine Learning -Based Cancer Outcome Prediction in PET and SPECT Imaging [0.7573820776203027]
This systematic review analyzed 226 studies published from 2020 to 2025 that applied machine learning to PET or SPECT imaging for outcome prediction.<n> PET-based studies generally outperformed those using SPECT, likely due to higher spatial resolution and sensitivity.<n>Common limitations included inadequate handling of class imbalance, missing data, and low population diversity.
arXiv Detail & Related papers (2025-07-21T21:03:12Z)
Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems [1.1373722549440357]
We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis CT reports.<n>Performance was evaluated using Cohen's Kappa and micro/macro-averaged F1 scores.
arXiv Detail & Related papers (2025-06-03T18:00:08Z)
ECTIL: Label-efficient Computational Tumour Infiltrating Lymphocyte (TIL) assessment in breast cancer: Multicentre validation in 2,340 patients with breast cancer [17.91294880294883]
Level of tumour-infiltrating lymphocytes (TILs) is a prognostic factor for patients with (triple-negative) breast cancer.<n>Current Computational TIL assessment (CTA) models rely heavily on many detailed annotations.<n>We propose a fundamentally simpler deep learning based model that can be trained in only ten minutes on hundredfold fewer pathologist annotations.
arXiv Detail & Related papers (2025-01-24T10:28:05Z)
Deep Radiomics Detection of Clinically Significant Prostate Cancer on Multicenter MRI: Initial Comparison to PI-RADS Assessment [0.0]
This study analyzed biparametric (T2W and DW) prostate MRI sequences of 615 patients (mean age, 63.1 +/- 7 years) from four datasets acquired between 2010 and 2020. Deep radiomics machine learning model achieved comparable performance to PI-RADS assessment in csPCa detection at the patient-level but not at the lesion-level.
arXiv Detail & Related papers (2024-10-21T17:41:58Z)
CIMIL-CRC: a clinically-informed multiple instance learning framework for patient-level colorectal cancer molecular subtypes classification from H\&E stained images [42.771819949806655]
We introduce CIMIL-CRC', a framework that solves the MSI/MSS MIL problem by efficiently combining a pre-trained feature extraction model with principal component analysis (PCA) to aggregate information from all patches. We assessed our CIMIL-CRC method using the average area under the curve (AUC) from a 5-fold cross-validation experimental setup for model development on the TCGA-CRC-DX cohort.
arXiv Detail & Related papers (2024-01-29T12:56:11Z)
Attention-based Saliency Maps Improve Interpretability of Pneumothorax Classification [52.77024349608834]
To investigate chest radiograph (CXR) classification performance of vision transformers (ViT) and interpretability of attention-based saliency. ViTs were fine-tuned for lung disease classification using four public data sets: CheXpert, Chest X-Ray 14, MIMIC CXR, and VinBigData. ViTs had comparable CXR classification AUCs compared with state-of-the-art CNNs.
arXiv Detail & Related papers (2023-03-03T12:05:41Z)
Learning to diagnose cirrhosis from radiological and histological labels with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset. We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis. This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z)
DeepCOVID-Fuse: A Multi-modality Deep Learning Model Fusing Chest X-Radiographs and Clinical Variables to Predict COVID-19 Risk Levels [8.593516170110203]
DeepCOVID-Fuse is a deep learning fusion model to predict risk levels in coronavirus patients. The accuracy of DeepCOVID-Fuse trained on CXRs and clinical variables is 0.658, with an AUC of 0.842.
arXiv Detail & Related papers (2023-01-20T20:54:25Z)
WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung Adenocarcinoma [51.50991881342181]
This challenge includes 10,091 patch-level annotations and over 130 million labeled pixels. First place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919)
arXiv Detail & Related papers (2022-04-13T15:27:05Z)
Machine Learning Automatically Detects COVID-19 using Chest CTs in a Large Multicenter Cohort [43.99203831722203]
Our retrospective study obtained 2096 chest CTs from 16 institutions. A metric-based approach for classification of COVID-19 used interpretable features. A deep learning-based classifier differentiated COVID-19 via 3D features extracted from CT attenuation and probability distribution of airspace opacities.
arXiv Detail & Related papers (2020-06-09T00:40:35Z)
Automated Quantification of CT Patterns Associated with COVID-19 from Chest CT [48.785596536318884]
The proposed method takes as input a non-contrasted chest CT and segments the lesions, lungs, and lobes in three dimensions. The method outputs two combined measures of the severity of lung and lobe involvement, quantifying both the extent of COVID-19 abnormalities and presence of high opacities. Evaluation of the algorithm is reported on CTs of 200 participants (100 COVID-19 confirmed patients and 100 healthy controls) from institutions from Canada, Europe and the United States.
arXiv Detail & Related papers (2020-04-02T21:49:14Z)
Machine-Learning-Based Multiple Abnormality Prediction with Large-Scale Chest Computed Tomography Volumes [64.21642241351857]
We curated and analyzed a chest computed tomography (CT) data set of 36,316 volumes from 19,993 unique patients. We developed a rule-based method for automatically extracting abnormality labels from free-text radiology reports. We also developed a model for multi-organ, multi-disease classification of chest CT volumes.
arXiv Detail & Related papers (2020-02-12T00:59:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.