Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation
- URL: http://arxiv.org/abs/2601.16753v1
- Date: Fri, 23 Jan 2026 13:57:09 GMT
- Title: Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation
- Authors: Xinyi Wang, Grazziela Figueredo, Ruizhe Li, Xin Chen,
- Abstract summary: Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time.<n>There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts.<n>Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules.
- Score: 10.771534459008699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time, which is crucial for monitoring disease progression and guiding clinical decisions. Many recent automated radiology report generation methods are designed to capture longitudinal information; however, validating their performance is challenging. There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts for meaningful comparisons. Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules. Complex rules are closed-source, domain specific and hard to adapt, whereas overly simple ones tend to miss essential specialised information. Large language models (LLMs) offer a promising annotation alternative, as they are capable of capturing nuanced linguistic patterns and semantic similarities without extensive manual intervention. They also adapt well to new contexts. In this study, we therefore propose an LLM-based pipeline to automatically annotate longitudinal information in radiology reports. The pipeline first identifies sentences containing relevant information and then extracts the progression of diseases. We evaluate and compare five mainstream LLMs on these two tasks using 500 manually annotated reports. Considering both efficiency and performance, Qwen2.5-32B was subsequently selected and used to annotate another 95,169 reports from the public MIMIC-CXR dataset. Our Qwen2.5-32B-annotated dataset provided us with a standardized benchmark for evaluating report generation models. Using this new benchmark, we assessed seven state-of-the-art report generation models. Our LLM-based annotation method outperforms existing annotation solutions, achieving 11.3\% and 5.3\% higher F1-scores for longitudinal information detection and disease tracking, respectively.
Related papers
- Ontology-Based Concept Distillation for Radiology Report Retrieval and Labeling [10.504309161945065]
Most existing methods rely on comparing high-dimensional text embeddings from models like CLIP or CXR-BERT.<n>We propose a novel, ontology-driven alternative for comparing radiology report texts based on clinically grounded concepts from the Unified Medical Language System.<n>Our method extracts standardised medical entities from free-text reports using an enhanced pipeline built on RadGraph-XL and SapBERT.
arXiv Detail & Related papers (2025-08-27T14:20:50Z) - Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation [10.440241401950745]
We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation.<n>It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations.<n>Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations.<n>We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency.
arXiv Detail & Related papers (2025-05-27T13:40:00Z) - High-Fidelity Pseudo-label Generation by Large Language Models for Training Robust Radiology Report Classifiers [0.2158126716116375]
DeBERTa-RAD is a novel framework that combines the power of state-of-the-art LLM pseudo-labeling with efficient DeBERTa-based knowledge distillation for accurate and fast chest X-ray report labeling.<n> evaluated on the expert-annotated MIMIC-500 benchmark, DeBERTa-RAD achieves a state-of-the-art Macro F1 score of 0.9120.
arXiv Detail & Related papers (2025-05-03T04:50:55Z) - HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation [89.3260120072177]
We propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for Radiology report generation.<n>Our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression.<n> Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models.
arXiv Detail & Related papers (2024-12-15T06:04:16Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Guidance in Radiology Report Summarization: An Empirical Evaluation and
Error Analysis [3.0204520109309847]
We propose a domain-agnostic guidance signal for summarizing radiology reports.
We run an expert evaluation of four systems according to a taxonomy of 11 fine-grained errors.
We find that the most pressing differences between automatic summaries and those of radiologists relate to content selection.
arXiv Detail & Related papers (2023-07-24T13:54:37Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Automated Labeling of German Chest X-Ray Radiology Reports using Deep
Learning [50.591267188664666]
We propose a deep learning-based CheXpert label prediction model, pre-trained on reports labeled by a rule-based German CheXpert model.
Our results demonstrate the effectiveness of our approach, which significantly outperformed the rule-based model on all three tasks.
arXiv Detail & Related papers (2023-06-09T16:08:35Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.