Related papers: CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

Related papers

Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification [60.18369393468405]
Existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration.<n>GLEAN compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals.<n>We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.
arXiv Detail & Related papers (2026-03-03T09:36:43Z)
AgentScore: Autoformulation of Deployable Clinical Scoring Systems [45.88028371034407]
We introduce AgentScore, which performs semantically guided optimization in unit-weighted clinical checklists.<n>AgentScore outperforms existing score-generation methods and achieves AUC comparable to more flexible interpretable models.<n>On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
arXiv Detail & Related papers (2026-01-29T21:11:06Z)
CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation [8.08950963137043]
We present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG.<n>The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases.<n>Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, Ra
arXiv Detail & Related papers (2026-01-16T18:09:19Z)
Calibratable Disambiguation Loss for Multi-Instance Partial-Label Learning [53.9713678229744]
Multi-instance partial-label learning (MIPL) is a weakly supervised framework that addresses the challenges of inexact supervision in both instance and label spaces.<n>Existing MIPL approaches often suffer from poor calibration, undermining reliability.<n>We propose a plug-and-play calibratable disambiguation loss (CDL) that simultaneously improves classification accuracy and calibration performance.
arXiv Detail & Related papers (2025-12-19T16:58:31Z)
MIMIC-SR-ICD11: A Dataset for Narrative-Based Diagnosis [14.505360834752866]
We introduce MIMIC-SR-ICD11, a large English diagnostic dataset built from EHR discharge notes and aligned to WHO ICD-11 terminology.<n>We present LL-Rank, a likelihood-based re-ranking framework that computes a length-normalized joint likelihood of each label given the clinical report context.
arXiv Detail & Related papers (2025-11-07T18:55:22Z)
Conformal Lesion Segmentation for 3D Medical Images [82.92159832699583]
We propose a risk-constrained framework that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance.<n>We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
arXiv Detail & Related papers (2025-10-19T08:21:00Z)
Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation [3.952186976672079]
We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods.<n>To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component.
arXiv Detail & Related papers (2025-10-08T23:50:58Z)
S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework [39.542375803362965]
Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI.<n>Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details.<n>We present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework.
arXiv Detail & Related papers (2025-08-04T05:49:41Z)
Aligning Evaluation with Clinical Priorities: Calibration, Label Shift, and Error Costs [3.299877799532224]
We propose a principled yet practical evaluation framework for selecting calibrated thresholded classifiers.<n>We derive an adjusted variant of cross-entropy (log score) that averages cost-weighted performance over clinically relevant ranges of class balance.<n>The resulting evaluation is simple to apply, sensitive to clinical deployment conditions, and designed to prioritize models that are both calibrated and robust to real-world variations.
arXiv Detail & Related papers (2025-06-17T14:01:39Z)
CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports [4.477840500181267]
We introduce CaseReportBench, an expert-annotated dataset for dense information extraction of case reports, focusing on IEMs.<n>We assess various models and prompting strategies, introducing novel approaches such as category-specific prompting and subheading-filtered data integration.<n>Our clinician evaluations show that LLMs can extract clinically relevant details from case reports, supporting rare disease diagnosis and management.
arXiv Detail & Related papers (2025-05-22T20:21:32Z)
CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation [19.416198842242856]
We introduce a Clinically-grounded framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR)<n>CLEAR examines whether a report can accurately identify the presence or absence of medical conditions.<n>To measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR.
arXiv Detail & Related papers (2025-05-22T07:32:12Z)
GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation [8.071354543390274]
We propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper.<n>GEMA-Score conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow.<n>Experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
Quality assurance of organs-at-risk delineation in radiotherapy [7.698565355235687]
The delineation of tumor target and organs-at-risk is critical in the radiotherapy treatment planning. The quality assurance of the automatic segmentation is still an unmet need in clinical practice. Our proposed model, which introduces residual network and attention mechanism in the one-class classification framework, was able to detect the various types of OAR contour errors with high accuracy.
arXiv Detail & Related papers (2024-05-20T02:32:46Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right? [135.71855998537347]
We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy. We show that this metric is unreliable, as even vacuous blind classifiers can achieve unrealistically high online accuracy. Existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information.
arXiv Detail & Related papers (2023-05-16T08:29:33Z)
Learning to diagnose cirrhosis from radiological and histological labels with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset. We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis. This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z)
Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks. By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation. DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z)
A Benchmark for Weakly Semi-Supervised Abnormality Localization in Chest X-Rays [42.1336336144291]
We propose to train the CXR abnormality localization framework via a weakly semi-supervised strategy, termed Point Beyond Class. The core idea behind our PBC is to learn a robust and accurate mapping from the point annotations to the bounding boxes. Experimental results on RSNA and VinDr-CXR datasets justify the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-09-05T14:36:07Z)
Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images [83.7047542725469]
Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide.
arXiv Detail & Related papers (2021-09-22T15:06:06Z)
Inheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making. We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.