Common Metrics to Benchmark Human-Machine Teams (HMT): A Review
- URL: http://arxiv.org/abs/2008.04855v1
- Date: Tue, 11 Aug 2020 16:57:52 GMT
- Title: Common Metrics to Benchmark Human-Machine Teams (HMT): A Review
- Authors: Praveen Damacharla, Ahmad Y. Javaid, Jennie J. Gallimore, Vijay K.
Devabhaktuni
- Abstract summary: Metrics are the enabling tools to devise a benchmark in any system.
There is no agreed-upon set of benchmark metrics for developing HMT systems.
The key focus of this review is to conduct a detailed survey aimed at identification of metrics employed in different segments of HMT.
- Score: 1.0323063834827415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A significant amount of work is invested in human-machine teaming (HMT)
across multiple fields. Accurately and effectively measuring system performance
of an HMT is crucial for moving the design of these systems forward. Metrics
are the enabling tools to devise a benchmark in any system and serve as an
evaluation platform for assessing the performance, along with the verification
and validation, of a system. Currently, there is no agreed-upon set of
benchmark metrics for developing HMT systems. Therefore, identification and
classification of common metrics are imperative to create a benchmark in the
HMT field. The key focus of this review is to conduct a detailed survey aimed
at identification of metrics employed in different segments of HMT and to
determine the common metrics that can be used in the future to benchmark HMTs.
We have organized this review as follows: identification of metrics used in
HMTs until now, and classification based on functionality and measuring
techniques. Additionally, we have also attempted to analyze all the identified
metrics in detail while classifying them as theoretical, applied, real-time,
non-real-time, measurable, and observable metrics. We conclude this review with
a detailed analysis of the identified common metrics along with their usage to
benchmark HMTs.
Related papers
- From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set [17.60104729231524]
We design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning examples.
We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets.
arXiv Detail & Related papers (2024-11-23T00:02:21Z) - MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration [14.636927775315783]
We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences.
Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines.
arXiv Detail & Related papers (2024-11-01T06:34:30Z) - Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics [46.71836180414362]
We introduce an interpretable evaluation framework for Machine Translation (MT) metrics.
Within this framework, we evaluate metrics in two scenarios that serve as proxies for the data filtering and translation re-ranking use cases.
We also raise concerns regarding the reliability of manually curated data following the Direct Assessments+Scalar Quality Metrics (DA+SQM) guidelines.
arXiv Detail & Related papers (2024-10-07T16:42:10Z) - Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In! [80.3129093617928]
Annually, at the Conference of Machine Translation (WMT), the Metrics Shared Task organizers conduct the meta-evaluation of Machine Translation (MT) metrics.
This work highlights two issues with the meta-evaluation framework currently employed in WMT, and assesses their impact on the metrics rankings.
We introduce the concept of sentinel metrics, which are designed explicitly to scrutinize the meta-evaluation process's accuracy, robustness, and fairness.
arXiv Detail & Related papers (2024-08-25T13:29:34Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Extrinsic Evaluation of Machine Translation Metrics [78.75776477562087]
It is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level.
We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks.
Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes.
arXiv Detail & Related papers (2022-12-20T14:39:58Z) - Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies.
We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking [48.497889944886516]
Multi-Object Tracking (MOT) has been notoriously difficult to evaluate.
Previous metrics overemphasize the importance of either detection or association.
We present a novel MOT evaluation metric, HOTA, which balances the effect of performing accurate detection, association and localization.
arXiv Detail & Related papers (2020-09-16T15:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.