Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg
- URL: http://arxiv.org/abs/2504.15667v1
- Date: Tue, 22 Apr 2025 07:42:48 GMT
- Title: Performance Estimation for Supervised Medical Image Segmentation Models on Unlabeled Data Using UniverSeg
- Authors: Jingchen Zou, Jianqiang Li, Gabriel Jimenez, Qing Zhao, Daniel Racoceanu, Matias Cosarinsky, Enzo Ferrante, Guanghui Fu,
- Abstract summary: We propose a framework for estimating segmentation models' performance on unlabeled data.<n>The Performance Evaluator (SPE) framework integrates seamlessly into any model training process.
- Score: 8.893478932454082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of medical image segmentation models is usually evaluated using metrics like the Dice score and Hausdorff distance, which compare predicted masks to ground truth annotations. However, when applying the model to unseen data, such as in clinical settings, it is often impractical to annotate all the data, making the model's performance uncertain. To address this challenge, we propose the Segmentation Performance Evaluator (SPE), a framework for estimating segmentation models' performance on unlabeled data. This framework is adaptable to various evaluation metrics and model architectures. Experiments on six publicly available datasets across six evaluation metrics including pixel-based metrics such as Dice score and distance-based metrics like HD95, demonstrated the versatility and effectiveness of our approach, achieving a high correlation (0.956$\pm$0.046) and low MAE (0.025$\pm$0.019) compare with real Dice score on the independent test set. These results highlight its ability to reliably estimate model performance without requiring annotations. The SPE framework integrates seamlessly into any model training process without adding training overhead, enabling performance estimation and facilitating the real-world application of medical image segmentation algorithms. The source code is publicly available
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images [22.36128130052757]
We build a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging.
This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions.
arXiv Detail & Related papers (2024-09-23T10:12:08Z) - Estimating Model Performance Under Covariate Shift Without Labels [9.804680621164168]
We introduce Probabilistic Adaptive Performance Estimation (PAPE) for evaluating classification models on unlabeled data.
PAPE provides more accurate performance estimates than other evaluated methodologies.
arXiv Detail & Related papers (2024-01-16T13:29:30Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.
Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - A quality assurance framework for real-time monitoring of deep learning
segmentation models in radiotherapy [3.5752677591512487]
This work uses cardiac substructure segmentation as an example task to establish a quality assurance framework.
A benchmark dataset consisting of Computed Tomography (CT) images along with manual cardiac delineations of 241 patients was collected.
An image domain shift detector was developed by utilizing a trained Denoising autoencoder (DAE) and two hand-engineered features.
A regression model was trained to predict the per-patient segmentation accuracy, measured by Dice similarity coefficient (DSC)
arXiv Detail & Related papers (2023-05-19T14:51:05Z) - AI in the Loop -- Functionalizing Fold Performance Disagreement to
Monitor Automated Medical Image Segmentation Pipelines [0.0]
Methods for automatically flag poor performing-predictions are essential for safely implementing machine learning into clinical practice.
We present a readily adoptable method using sub-models trained on different dataset folds, where their disagreement serves as a surrogate for model confidence.
arXiv Detail & Related papers (2023-05-15T21:35:23Z) - Effective Robustness against Natural Distribution Shifts for Models with
Different Training Data [113.21868839569]
"Effective robustness" measures the extra out-of-distribution robustness beyond what can be predicted from the in-distribution (ID) performance.
We propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data.
arXiv Detail & Related papers (2023-02-02T19:28:41Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - A critical analysis of metrics used for measuring progress in artificial
intelligence [9.387811897655016]
We analyse the current landscape of performance metrics based on data covering 3867 machine learning model performance results.
Results suggest that the large majority of metrics currently used have properties that may result in an inadequate reflection of a models' performance.
We describe ambiguities in reported metrics, which may lead to difficulties in interpreting and comparing model performances.
arXiv Detail & Related papers (2020-08-06T11:14:37Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.