Related papers: S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework

S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework

URL: http://arxiv.org/abs/2508.02082v1
Date: Mon, 04 Aug 2025 05:49:41 GMT
Title: S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework
Authors: Yingshu Li, Yunyi Liu, Zhanyu Wang, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou,
Abstract summary: Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI.<n>Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details.<n>We present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework.
Score: 39.542375803362965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI. Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details. Structured radiology report generation (S-RRG) offers a promising solution by organizing information into standardized, concise formats. However, existing approaches often rely on classification or visual question answering (VQA) pipelines that require predefined label sets and produce only fragmented outputs. Template-based approaches, which generate reports by replacing keywords within fixed sentence patterns, further compromise expressiveness and often omit clinically important details. In this work, we present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework. We first create a robust chest X-ray dataset (MIMIC-STRUC) that includes disease names, severity levels, probabilities, and anatomical locations, ensuring that the dataset is both clinically relevant and well-structured. We train an LLM-based model to generate standardized, high-quality reports. To assess the generated reports, we propose a specialized evaluation metric (S-Score) that not only measures disease prediction accuracy but also evaluates the precision of disease-specific details, thus offering a clinically meaningful metric for report quality that focuses on elements critical to clinical decision-making and demonstrates a stronger alignment with human assessments. Our approach highlights the effectiveness of structured reports and the importance of a tailored evaluation metric for S-RRG, providing a more clinically relevant measure of report quality.

Related papers

Clinically Grounded Agent-based Report Evaluation: An Interpretable Metric for Radiology Report Generation [32.410641778559544]
ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation) is an interpretable evaluation framework.<n>Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other.<n>By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment.
arXiv Detail & Related papers (2025-08-04T18:28:03Z)
Automated Structured Radiology Report Generation [11.965406008391371]
We introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format.<n>We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata.<n>We also introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports.
arXiv Detail & Related papers (2025-05-30T05:23:01Z)
CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation [19.416198842242856]
We introduce a Clinically-grounded framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR)<n>CLEAR examines whether a report can accurately identify the presence or absence of medical conditions.<n>To measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR.
arXiv Detail & Related papers (2025-05-22T07:32:12Z)
GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation [7.838068874909676]
Granular Explainable Multi-Agent Score (GEMA-Score) conducts both objective and subjective evaluation through a large language model-based multi-agent workflow.<n>GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
Improving Radiology Report Conciseness and Structure via Local Large Language Models [0.0]
Radiology reports are often lengthy and unstructured, posing challenges for referring physicians.<n>This retrospective study aimed to enhance radiology reports by making them concise and well-structured.
arXiv Detail & Related papers (2024-11-06T19:00:57Z)
RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore) RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations. The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations. ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
FlexR: Few-shot Classification with Language Embeddings for Structured Reporting of Chest X-rays [37.15474283789249]
We propose a method to predict clinical findings defined by sentences in structured reporting templates. The approach involves training a contrastive language-image model using chest X-rays and related free-text radiological reports. Results show that even with limited image-level annotations for training, the method can accomplish the structured reporting tasks of severity assessment of cardiomegaly and localizing pathologies in chest X-rays.
arXiv Detail & Related papers (2022-03-29T16:31:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.