How Trustworthy are Performance Evaluations for Basic Vision Tasks?
- URL: http://arxiv.org/abs/2008.03533v4
- Date: Fri, 22 Jul 2022 09:48:22 GMT
- Title: How Trustworthy are Performance Evaluations for Basic Vision Tasks?
- Authors: Tran Thien Dat Nguyen, Hamid Rezatofighi, Ba-Ngu Vo, Ba-Tuong Vo,
Silvio Savarese, Ian Reid
- Abstract summary: This paper examines performance evaluation criteria for basic vision tasks involving sets of objects namely, object detection, instance-level segmentation and multi-object tracking.
The rankings of algorithms by an existing criterion can fluctuate with different choices of parameters, making their evaluations unreliable.
This work suggests a notion of trustworthiness for performance criteria, which requires (i) robustness to parameters for reliability, (ii) contextual meaningfulness in sanity tests, and (iii) consistency with mathematical requirements such as the metric properties.
- Score: 46.0590176230731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper examines performance evaluation criteria for basic vision tasks
involving sets of objects namely, object detection, instance-level segmentation
and multi-object tracking. The rankings of algorithms by an existing criterion
can fluctuate with different choices of parameters, e.g. Intersection over
Union (IoU) threshold, making their evaluations unreliable. More importantly,
there is no means to verify whether we can trust the evaluations of a
criterion. This work suggests a notion of trustworthiness for performance
criteria, which requires (i) robustness to parameters for reliability, (ii)
contextual meaningfulness in sanity tests, and (iii) consistency with
mathematical requirements such as the metric properties. We observe that these
requirements were overlooked by many widely-used criteria, and explore
alternative criteria using metrics for sets of shapes. We also assess all these
criteria based on the suggested requirements for trustworthiness.
Related papers
- CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models.
CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z) - HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation [18.049566239050762]
Proper evaluation metrics are like a beacon guiding the research of simile generation (SG)
To address the issues, we establish HA, a holistic and automatic evaluation system for the SG task, which consists of five criteria from three perspectives and automatic metrics for each criterion.
Our metrics are significantly more correlated with human ratings from each perspective compared with prior automatic metrics.
arXiv Detail & Related papers (2023-06-13T06:06:01Z) - APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS)
We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect.
Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z) - Towards a multi-stakeholder value-based assessment framework for
algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values.
We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - Perturbation CheckLists for Evaluating NLG Evaluation Metrics [16.20764980129339]
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria.
Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated.
This suggests that the current recipe of proposing new automatic evaluation metrics for NLG is inadequate.
arXiv Detail & Related papers (2021-09-13T08:26:26Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.