Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union
- URL: http://arxiv.org/abs/2310.19252v1
- Date: Mon, 30 Oct 2023 03:45:15 GMT
- Title: Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union
- Authors: Zifu Wang and Maxim Berman and Amal Rannen-Triki and Philip H.S. Torr
and Devis Tuia and Tinne Tuytelaars and Luc Van Gool and Jiaqian Yu and
Matthew B. Blaschko
- Abstract summary: We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
- Score: 113.20223082664681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic segmentation datasets often exhibit two types of imbalance:
\textit{class imbalance}, where some classes appear more frequently than others
and \textit{size imbalance}, where some objects occupy more pixels than others.
This causes traditional evaluation metrics to be biased towards
\textit{majority classes} (e.g. overall pixel-wise accuracy) and \textit{large
objects} (e.g. mean pixel-wise accuracy and per-dataset mean intersection over
union). To address these shortcomings, we propose the use of fine-grained mIoUs
along with corresponding worst-case metrics, thereby offering a more holistic
evaluation of segmentation techniques. These fine-grained metrics offer less
bias towards large objects, richer statistical information, and valuable
insights into model and dataset auditing. Furthermore, we undertake an
extensive benchmark study, where we train and evaluate 15 modern neural
networks with the proposed metrics on 12 diverse natural and aerial
segmentation datasets. Our benchmark study highlights the necessity of not
basing evaluations on a single metric and confirms that fine-grained mIoUs
reduce the bias towards large objects. Moreover, we identify the crucial role
played by architecture designs and loss functions, which lead to best practices
in optimizing fine-grained metrics. The code is available at
\href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.
Related papers
- Fine-grained Metrics for Point Cloud Semantic Segmentation [6.713120348917712]
Two forms of imbalances are commonly observed in point cloud semantic segmentation datasets.
The majority of categories and large objects are favored in the existing evaluation metrics.
This paper suggests fine-grained mIoU and mAcc for a more thorough assessment of point cloud segmentation algorithms.
arXiv Detail & Related papers (2024-07-31T02:25:30Z) - Size-invariance Matters: Rethinking Metrics and Losses for Imbalanced Multi-object Salient Object Detection [133.66006666465447]
Current metrics are size-sensitive, where larger objects are focused, and smaller ones tend to be ignored.
We argue that the evaluation should be size-invariant because bias based on size is unjustified without additional semantic information.
We develop an optimization framework tailored to this goal, achieving considerable improvements in detecting objects of different sizes.
arXiv Detail & Related papers (2024-05-16T03:01:06Z) - $F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$.
It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z) - Piecewise-Linear Manifolds for Deep Metric Learning [8.670873561640903]
Unsupervised deep metric learning focuses on learning a semantic representation space using only unlabeled data.
We propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point.
We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques.
arXiv Detail & Related papers (2024-03-22T06:22:20Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Joint Metrics Matter: A Better Standard for Trajectory Forecasting [67.1375677218281]
Multi-modal trajectory forecasting methods evaluate using single-agent metrics (marginal metrics)
Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for people who are clearly walking together as a group.
We present the first comprehensive evaluation of state-of-the-art trajectory forecasting methods with respect to multi-agent metrics (joint metrics): JADE, JFDE, and collision rate.
arXiv Detail & Related papers (2023-05-10T16:27:55Z) - Long-tail Detection with Effective Class-Margins [4.18804572788063]
We show how the commonly used mean average precision evaluation metric on an unknown test set is bound by a margin-based binary classification error.
We optimize margin-based binary classification error with a novel surrogate objective called text-Effective Class-Margin Loss (ECM)
arXiv Detail & Related papers (2023-01-23T21:25:24Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.