USE-Evaluator: Performance Metrics for Medical Image Segmentation Models
with Uncertain, Small or Empty Reference Annotations
- URL: http://arxiv.org/abs/2209.13008v4
- Date: Thu, 7 Sep 2023 16:34:17 GMT
- Title: USE-Evaluator: Performance Metrics for Medical Image Segmentation Models
with Uncertain, Small or Empty Reference Annotations
- Authors: Sophie Ostmeier, Brian Axelrod, Jeroen Bertels, Fabian Isensee,
Maarten G.Lansberg, Soren Christensen, Gregory W. Albers, Li-Jia Li, Jeremy
J. Heit
- Abstract summary: There is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice.
Common metrics fail to measure the impact of this mismatch, especially for clinical data sets.
We study how uncertain, small, or empty reference annotations influence the value of metrics for medical image segmentation.
- Score: 5.672489398972326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performance metrics for medical image segmentation models are used to measure
the agreement between the reference annotation and the predicted segmentation.
Usually, overlap metrics, such as the Dice, are used as a metric to evaluate
the performance of these models in order for results to be comparable. However,
there is a mismatch between the distributions of cases and difficulty level of
segmentation tasks in public data sets compared to clinical practice. Common
metrics fail to measure the impact of this mismatch, especially for clinical
data sets that include low signal pathologies, a difficult segmentation task,
and uncertain, small, or empty reference annotations. This limitation may
result in ineffective research of machine learning practitioners in designing
and optimizing models. Dimensions of evaluating clinical value include
consideration of the uncertainty of reference annotations, independence from
reference annotation volume size, and evaluation of classification of empty
reference annotations. We study how uncertain, small, and empty reference
annotations influence the value of metrics for medical image segmentation on an
in-house data set regardless of the model. We examine metrics behavior on the
predictions of a standard deep learning framework in order to identify metrics
with clinical value. We compare to a public benchmark data set (BraTS 2019)
with a high-signal pathology and certain, larger, and no empty reference
annotations. We may show machine learning practitioners, how uncertain, small,
or empty reference annotations require a rethinking of the evaluation and
optimizing procedures. The evaluation code was released to encourage further
analysis of this topic.
https://github.com/SophieOstmeier/UncertainSmallEmpty.git
Related papers
- Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks [60.80828925396154]
We present Connected-Component(CC)-Metrics, a novel semantic segmentation evaluation protocol.
We motivate this setup in the common medical scenario of semantic segmentation in a full-body PET/CT.
We show how existing semantic segmentation metrics suffer from a bias towards larger connected components.
arXiv Detail & Related papers (2024-10-24T12:26:05Z) - Segmentation Quality and Volumetric Accuracy in Medical Imaging [0.9426448361599084]
Current medical image segmentation relies on the region-based (Dice, F1-score) and boundary-based (Hausdorff distance, surface distance) metrics as the de-facto standard.
While these metrics are widely used, they lack a unified interpretation, particularly regarding volume agreement.
We utilize relative volume prediction error (vpe) to directly assess the accuracy of volume predictions derived from segmentation tasks.
arXiv Detail & Related papers (2024-04-27T00:49:39Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations.
Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation.
We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z) - Rethinking Generalization: The Impact of Annotation Style on Medical
Image Segmentation [9.056814157662965]
We show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets.
Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified.
arXiv Detail & Related papers (2022-10-31T15:28:49Z) - CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels.
We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation.
Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - Label Cleaning Multiple Instance Learning: Refining Coarse Annotations
on Single Whole-Slide Images [83.7047542725469]
Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development.
We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data.
Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide.
arXiv Detail & Related papers (2021-09-22T15:06:06Z) - Disentangling Human Error from the Ground Truth in Segmentation of
Medical Images [12.009437407687987]
We present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions.
We demonstrate the utility of the method on three public medical imaging segmentation datasets with simulated (when necessary) and real diverse annotations.
arXiv Detail & Related papers (2020-07-31T11:03:12Z) - Classifier uncertainty: evidence, potential impact, and probabilistic
treatment [0.0]
We present an approach to quantify the uncertainty of classification performance metrics based on a probability model of the confusion matrix.
We show that uncertainties can be surprisingly large and limit performance evaluation.
arXiv Detail & Related papers (2020-06-19T12:49:19Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.