Related papers: How precise are performance estimates for typical medical image segmentation tasks?

How precise are performance estimates for typical medical image segmentation tasks?

URL: http://arxiv.org/abs/2210.14677v3
Date: Wed, 24 May 2023 12:32:40 GMT
Title: How precise are performance estimates for typical medical image segmentation tasks?
Authors: Rosana El Jurdi and Olivier Colliot
Abstract summary: In this paper, we aim to estimate what is the typical confidence that can be expected in medical image segmentation studies. We extensively study precision estimation using both Gaussian assumption and bootstrapping. Overall, our work shows that small test sets lead to wide confidence intervals.
Score: 3.606795745041439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An important issue in medical image processing is to be able to estimate not only the performances of algorithms but also the precision of the estimation of these performances. Reporting precision typically amounts to reporting standard-error of the mean (SEM) or equivalently confidence intervals. However, this is rarely done in medical image segmentation studies. In this paper, we aim to estimate what is the typical confidence that can be expected in such studies. To that end, we first perform experiments for Dice metric estimation using a standard deep learning model (U-net) and a classical task from the Medical Segmentation Decathlon. We extensively study precision estimation using both Gaussian assumption and bootstrapping (which does not require any assumption on the distribution). We then perform simulations for other test set sizes and performance spreads. Overall, our work shows that small test sets lead to wide confidence intervals (e.g. $\sim$8 points of Dice for 20 samples with $\sigma \simeq 10$).

Related papers

Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg [8.893478932454082]
We propose a framework for estimating segmentation models' performance on unlabeled data.<n>The Performance Evaluator (SPE) framework integrates seamlessly into any model training process.
arXiv Detail & Related papers (2025-04-22T07:42:48Z)
Confidence Intervals for Evaluation of Data Mining [3.8485822412233452]
We consider statistical inference about general performance measures used in data mining. We study the finite sample coverage probabilities for confidence intervals. We also propose a blurring correction' on the variance to improve the finite sample performance.
arXiv Detail & Related papers (2025-02-10T20:22:02Z)
How Much is Unseen Depends Chiefly on Information About the Seen [2.169081345816618]
We find that the proportion of data points in an unknown population-that belong to classes that do not appear in the training data-is almost entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times. We develop a genetic algorithm that, given only the sample, searches for an estimator with minimal mean-squared error (MSE)
arXiv Detail & Related papers (2024-02-08T17:12:49Z)
Confidence intervals for performance estimates in 3D medical image segmentation [0.0]
We study the typical confidence intervals in medical image segmentation. We show that the test size needed to achieve a given precision is often much lower than for classification tasks.
arXiv Detail & Related papers (2023-07-20T14:52:45Z)
Usable Region Estimate for Assessing Practical Usability of Medical Image Segmentation Models [32.56957759180135]
We aim to quantitatively measure the practical usability of medical image segmentation models. We first propose a measure, Correctness-Confidence Rank Correlation (CCRC), to capture how predictions' confidence estimates correlate with their correctness scores in rank. We then propose Usable Region Estimate (URE), which simultaneously quantifies predictions' correctness and reliability of confidence assessments in one estimate.
arXiv Detail & Related papers (2022-07-01T02:33:44Z)
Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks. Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular. We show that the statistical problems with covariance estimation drive the poor performance of H-score. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z)
Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models. Standard metrics calculated from retrospective data are only related to model utility under certain assumptions. When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
Showing Your Work Doesn't Always Work [73.63200097493576]
"Show Your Work: Improved Reporting of Experimental Results" advocates for reporting the expected validation effectiveness of the best-tuned model. We analytically show that their estimator is biased and uses error-prone assumptions. We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation.
arXiv Detail & Related papers (2020-04-28T17:59:01Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
Towards GAN Benchmarks Which Require Generalization [48.075521136623564]
We argue that estimating the function must require a large sample from the model. We turn to neural network divergences (NNDs) which are defined in terms of a neural network trained to distinguish between distributions. The resulting benchmarks cannot be "won" by training set memorization, while still being perceptually correlated and computable only from samples.
arXiv Detail & Related papers (2020-01-10T20:18:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.