Related papers: Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

URL: http://arxiv.org/abs/2511.11867v1
Date: Fri, 14 Nov 2025 20:55:44 GMT
Title: Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches
Authors: Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen, Martin Gunn,
Abstract summary: We introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status.<n>We compare traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct.<n>To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations.
Score: 8.864020712680976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

Related papers

Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches [5.958100741754613]
We evaluated large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas.<n>We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning.<n>The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79.
arXiv Detail & Related papers (2025-12-05T08:49:57Z)
Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy [0.0]
This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's taxonomy.<n>A small dataset of 600 sentences labeled with six cognitive categories was processed using traditional machine learning (ML) models.
arXiv Detail & Related papers (2025-11-14T02:31:12Z)
Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes [8.748610895973075]
We introduce PtychoBench, a new benchmark for ptychographic analysis.<n>We compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL)<n>Our findings reveal that the optimal specialization pathway is task-dependent.
arXiv Detail & Related papers (2025-11-04T11:43:05Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case [0.2516393111664279]
We introduce a semi-automated approach to convert structured clinical datasets into HL7 FHIR format.<n>In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2.<n>Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate.
arXiv Detail & Related papers (2025-07-03T17:32:57Z)
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking [0.0]
We develop a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets.<n>We implement a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation.<n>The system achieved 83.39% precision at rank 1 and 94.66% recall at rank 5.
arXiv Detail & Related papers (2025-05-01T19:09:15Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.