RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
- URL: http://arxiv.org/abs/2508.15464v1
- Date: Thu, 21 Aug 2025 11:34:30 GMT
- Title: RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
- Authors: Yingshu Li, Yunyi Liu, Lingqiao Liu, Lei Wang, Luping Zhou,
- Abstract summary: We introduce RadReason, a novel evaluation framework for radiology reports.<n>It outputs fine-grained sub-scores across six clinically defined error types.<n>It also produces human-readable justifications that explain the rationale behind each score.
- Score: 37.16761198532088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.
Related papers
- Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z) - From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes [0.0]
This study presents a GPT-based architecture for clinical text classification.<n>Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen.<n>The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset.
arXiv Detail & Related papers (2026-01-29T16:33:47Z) - AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z) - CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation [8.08950963137043]
We present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG.<n>The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases.<n>Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, Ra
arXiv Detail & Related papers (2026-01-16T18:09:19Z) - S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework [39.542375803362965]
Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI.<n>Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details.<n>We present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework.
arXiv Detail & Related papers (2025-08-04T05:49:41Z) - CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings [1.515687944002438]
We propose CLARIFID, a novel framework that directly optimize diagnostic correctness by mirroring the two-step workflow of experts.<n> CLARIFID learns the logical flow from Findings to Impression through section-aware pretraining.<n>We show that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.
arXiv Detail & Related papers (2025-07-23T05:57:59Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - Leveraging Professional Radiologists' Expertise to Enhance LLMs'
Evaluation for Radiology Reports [22.599250713630333]
Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs)
Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports.
Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
arXiv Detail & Related papers (2024-01-29T21:24:43Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Learning to diagnose cirrhosis from radiological and histological labels
with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset.
We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis.
This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.