Related papers: CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

URL: http://arxiv.org/abs/2601.11488v1
Date: Fri, 16 Jan 2026 18:09:19 GMT
Title: CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation
Authors: Vanshali Sharma, Andrea Mia Bejar, Gorkem Durak, Ulas Bagci,
Abstract summary: We present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG.<n>The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases.<n>Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, Ra
Score: 8.08950963137043
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.

Related papers

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation [10.15221228043609]
This paper investigates the use of decoding strategies that lead to high aggregate token-overlap scores despite template collapse.<n>We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports.<n>We show that deterministic decoding produces high levels of semantic erasure, while sampling generates diverse outputs but risks introducing new bias.
arXiv Detail & Related papers (2026-03-02T08:59:39Z)
Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z)
ReEvalMed: Rethinking Medical Report Evaluation by Aligning Metrics with Real-World Clinical Judgment [10.958326795130112]
We propose a clinically grounded Meta-Evaluation framework.<n>We define clinically grounded criteria spanning clinical alignment and key metric capabilities.<n>Our framework offers guidance for building more clinically reliable evaluation methods.
arXiv Detail & Related papers (2025-09-30T21:00:47Z)
RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores [37.16761198532088]
We introduce RadReason, a novel evaluation framework for radiology reports.<n>It outputs fine-grained sub-scores across six clinically defined error types.<n>It also produces human-readable justifications that explain the rationale behind each score.
arXiv Detail & Related papers (2025-08-21T11:34:30Z)
CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation [6.930435788495898]
We propose the CRG Score, a metric that evaluates only clinically relevant abnormalities explicitly described in reference reports.<n>By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
arXiv Detail & Related papers (2025-05-22T17:02:28Z)
GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation [7.838068874909676]
Granular Explainable Multi-Agent Score (GEMA-Score) conducts both objective and subjective evaluation through a large language model-based multi-agent workflow.<n>GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore) RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Collaborative Boundary-aware Context Encoding Networks for Error Map Prediction [65.44752447868626]
We propose collaborative boundaryaware context encoding networks called AEP-Net for error prediction task. Specifically, we propose a collaborative feature transformation branch for better feature fusion between images and masks, and precise localization of error regions. The AEP-Net achieves an average DSC of 0.8358, 0.8164 for error prediction task, and shows a high Pearson correlation coefficient of 0.9873.
arXiv Detail & Related papers (2020-06-25T12:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.