Related papers: USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations

URL: http://arxiv.org/abs/2209.13008v4
Date: Thu, 7 Sep 2023 16:34:17 GMT
Title: USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations
Authors: Sophie Ostmeier, Brian Axelrod, Jeroen Bertels, Fabian Isensee, Maarten G.Lansberg, Soren Christensen, Gregory W. Albers, Li-Jia Li, Jeremy J. Heit
Abstract summary: There is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets. We study how uncertain, small, or empty reference annotations influence the value of metrics for medical image segmentation.
Score: 5.672489398972326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets that include low signal pathologies, a difficult segmentation task, and uncertain, small, or empty reference annotations. This limitation may result in ineffective research of machine learning practitioners in designing and optimizing models. Dimensions of evaluating clinical value include consideration of the uncertainty of reference annotations, independence from reference annotation volume size, and evaluation of classification of empty reference annotations. We study how uncertain, small, and empty reference annotations influence the value of metrics for medical image segmentation on an in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify metrics with clinical value. We compare to a public benchmark data set (BraTS 2019) with a high-signal pathology and certain, larger, and no empty reference annotations. We may show machine learning practitioners, how uncertain, small, or empty reference annotations require a rethinking of the evaluation and optimizing procedures. The evaluation code was released to encourage further analysis of this topic. https://github.com/SophieOstmeier/UncertainSmallEmpty.git

Related papers

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg [8.893478932454082]
We propose a framework for estimating segmentation models' performance on unlabeled data. The Performance Evaluator (SPE) framework integrates seamlessly into any model training process.
arXiv Detail & Related papers (2025-04-22T07:42:48Z)
Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks [60.80828925396154]
We present Connected-Component(CC)-Metrics, a novel semantic segmentation evaluation protocol. We motivate this setup in the common medical scenario of semantic segmentation in a full-body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components.
arXiv Detail & Related papers (2024-10-24T12:26:05Z)
Segmentation Quality and Volumetric Accuracy in Medical Imaging [0.9426448361599084]
Current medical image segmentation relies on the region-based (Dice, F1-score) and boundary-based (Hausdorff distance, surface distance) metrics as the de-facto standard. While these metrics are widely used, they lack a unified interpretation, particularly regarding volume agreement. We utilize relative volume prediction error (vpe) to directly assess the accuracy of volume predictions derived from segmentation tasks.
arXiv Detail & Related papers (2024-04-27T00:49:39Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations. Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation. We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z)
Rethinking Generalization: The Impact of Annotation Style on Medical Image Segmentation [9.056814157662965]
We show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets. Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified.
arXiv Detail & Related papers (2022-10-31T15:28:49Z)
CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels. We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation. Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z)
Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images [83.7047542725469]
Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide.
arXiv Detail & Related papers (2021-09-22T15:06:06Z)
Disentangling Human Error from the Ground Truth in Segmentation of Medical Images [12.009437407687987]
We present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions. We demonstrate the utility of the method on three public medical imaging segmentation datasets with simulated (when necessary) and real diverse annotations.
arXiv Detail & Related papers (2020-07-31T11:03:12Z)
Classifier uncertainty: evidence, potential impact, and probabilistic treatment [0.0]
We present an approach to quantify the uncertainty of classification performance metrics based on a probability model of the confusion matrix. We show that uncertainties can be surprisingly large and limit performance evaluation.
arXiv Detail & Related papers (2020-06-19T12:49:19Z)
Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models. Standard metrics calculated from retrospective data are only related to model utility under certain assumptions. When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.