CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation
- URL: http://arxiv.org/abs/2505.17167v1
- Date: Thu, 22 May 2025 17:02:28 GMT
- Title: CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation
- Authors: Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Bernhard Kainz, Bjoern Menze,
- Abstract summary: We propose the CRG Score, a metric that evaluates only clinically relevant abnormalities explicitly described in reference reports.<n>By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
- Score: 6.930435788495898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.
Related papers
- S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework [39.542375803362965]
Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI.<n>Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details.<n>We present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework.
arXiv Detail & Related papers (2025-08-04T05:49:41Z) - Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs [3.299877799532224]
We propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers.<n>We derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance.<n>The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
arXiv Detail & Related papers (2025-06-17T14:01:39Z) - CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports [4.477840500181267]
We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs.<n>We assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration.<n>Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management.
arXiv Detail & Related papers (2025-05-22T20:21:32Z) - CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation [19.416198842242856]
We introduce a Clinically-grounded framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR)<n>CLEAR examines whether a report can accurately identify the presence or absence of medical conditions.<n>To measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR.
arXiv Detail & Related papers (2025-05-22T07:32:12Z) - GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation [8.071354543390274]
We propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper.<n>GEMA-Score conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow.<n>Experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z) - Quality assurance of organs-at-risk delineation in radiotherapy [7.698565355235687]
The delineation of tumor target and organs-at-risk is critical in the radiotherapy treatment planning.
The quality assurance of the automatic segmentation is still an unmet need in clinical practice.
Our proposed model, which introduces residual network and attention mechanism in the one-class classification framework, was able to detect the various types of OAR contour errors with high accuracy.
arXiv Detail & Related papers (2024-05-20T02:32:46Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - Rapid Adaptation in Online Continual Learning: Are We Evaluating It
Right? [135.71855998537347]
We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy.
We show that this metric is unreliable, as even vacuous blind classifiers can achieve unrealistically high online accuracy.
Existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information.
arXiv Detail & Related papers (2023-05-16T08:29:33Z) - Learning to diagnose cirrhosis from radiological and histological labels
with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset.
We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis.
This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z) - Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.
By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation.
DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - A Benchmark for Weakly Semi-Supervised Abnormality Localization in Chest
X-Rays [42.1336336144291]
We propose to train the CXR abnormality localization framework via a weakly semi-supervised strategy, termed Point Beyond Class.
The core idea behind our PBC is to learn a robust and accurate mapping from the point annotations to the bounding boxes.
Experimental results on RSNA and VinDr-CXR datasets justify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-09-05T14:36:07Z) - Label Cleaning Multiple Instance Learning: Refining Coarse Annotations
on Single Whole-Slide Images [83.7047542725469]
Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development.
We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data.
Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide.
arXiv Detail & Related papers (2021-09-22T15:06:06Z) - Inheritance-guided Hierarchical Assignment for Clinical Automatic
Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making.
We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.