SortedAP: Rethinking evaluation metrics for instance segmentation
- URL: http://arxiv.org/abs/2309.04887v1
- Date: Sat, 9 Sep 2023 22:50:35 GMT
- Title: SortedAP: Rethinking evaluation metrics for instance segmentation
- Authors: Long Chen, Yuli Wu, Johannes Stegmaier, Dorit Merhof
- Abstract summary: We show that most existing metrics have a limited resolution of segmentation quality.
We propose a new metric called sortedAP, which strictly decreases with both object- and pixel-level imperfections.
- Score: 8.079566596963632
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Designing metrics for evaluating instance segmentation revolves around
comprehensively considering object detection and segmentation accuracy.
However, other important properties, such as sensitivity, continuity, and
equality, are overlooked in the current study. In this paper, we reveal that
most existing metrics have a limited resolution of segmentation quality. They
are only conditionally sensitive to the change of masks or false predictions.
For certain metrics, the score can change drastically in a narrow range which
could provide a misleading indication of the quality gap between results.
Therefore, we propose a new metric called sortedAP, which strictly decreases
with both object- and pixel-level imperfections and has an uninterrupted
penalization scale over the entire domain. We provide the evaluation toolkit
and experiment code at https://www.github.com/looooongChen/sortedAP.
Related papers
- Every Component Counts: Rethinking the Measure of Success for Medical Semantic Segmentation in Multi-Instance Segmentation Tasks [60.80828925396154]
We present Connected-Component(CC)-Metrics, a novel semantic segmentation evaluation protocol.
We motivate this setup in the common medical scenario of semantic segmentation in a full-body PET/CT.
We show how existing semantic segmentation metrics suffer from a bias towards larger connected components.
arXiv Detail & Related papers (2024-10-24T12:26:05Z) - Size-invariance Matters: Rethinking Metrics and Losses for Imbalanced Multi-object Salient Object Detection [133.66006666465447]
Current metrics are size-sensitive, where larger objects are focused, and smaller ones tend to be ignored.
We argue that the evaluation should be size-invariant because bias based on size is unjustified without additional semantic information.
We develop an optimization framework tailored to this goal, achieving considerable improvements in detecting objects of different sizes.
arXiv Detail & Related papers (2024-05-16T03:01:06Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - SMATCH++: Standardized and Extended Evaluation of Semantic Graphs [4.987581730476023]
The Smatch metric is a popular method for evaluating graph distances.
We show how to fully conform to annotation guidelines that allow structurally deviating but valid graphs.
For improved scoring, we propose standardized and extended metric calculation of fine-grained sub-graph meaning aspects.
arXiv Detail & Related papers (2023-05-11T17:29:47Z) - Beyond mAP: Towards better evaluation of instance segmentation [23.562251593257674]
Average Precision does not penalize duplicate predictions in the high-recall range.
We propose two new measures to explicitly measure the amount of both spatial and categorical duplicate predictions.
Our Semantic Sorting and NMS can be added as a plug-and-play module to mitigate hedged predictions and preserve AP.
arXiv Detail & Related papers (2022-07-04T17:56:14Z) - On Quantitative Evaluations of Counterfactuals [88.42660013773647]
This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments.
We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases.
We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
arXiv Detail & Related papers (2021-10-30T05:00:36Z) - Boundary IoU: Improving Object-Centric Image Segmentation Evaluation [125.20898025044804]
We present Boundary IoU, a new segmentation evaluation measure focused on boundary quality.
Boundary IoU is significantly more sensitive than the standard Mask IoU measure to boundary errors for large objects and does not over-penalize errors on smaller objects.
arXiv Detail & Related papers (2021-03-30T17:59:20Z) - Evaluating Large-Vocabulary Object Detectors: The Devil is in the
Details [107.2722027807328]
We find that the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors.
We show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin.
We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation.
arXiv Detail & Related papers (2021-02-01T18:56:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.