Exploring and Analyzing Machine Commonsense Benchmarks
- URL: http://arxiv.org/abs/2012.11634v1
- Date: Mon, 21 Dec 2020 19:01:55 GMT
- Title: Exploring and Analyzing Machine Commonsense Benchmarks
- Authors: Henrique Santos, Minor Gordon, Zhicheng Liang, Gretchen Forbush,
Deborah L. McGuinness
- Abstract summary: We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies.
We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
- Score: 0.13999481573773073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commonsense question-answering (QA) tasks, in the form of benchmarks, are
constantly being introduced for challenging and comparing commonsense QA
systems. The benchmarks provide question sets that systems' developers can use
to train and test new models before submitting their implementations to
official leaderboards. Although these tasks are created to evaluate systems in
identified dimensions (e.g. topic, reasoning type), this metadata is limited
and largely presented in an unstructured format or completely not present.
Because machine common sense is a fast-paced field, the problem of fully
assessing current benchmarks and systems with regards to these evaluation
dimensions is aggravated. We argue that the lack of a common vocabulary for
aligning these approaches' metadata limits researchers in their efforts to
understand systems' deficiencies and in making effective choices for future
tasks. In this paper, we first discuss this MCS ecosystem in terms of its
elements and their metadata. Then, we present how we are supporting the
assessment of approaches by initially focusing on commonsense benchmarks. We
describe our initial MCS Benchmark Ontology, an extensible common vocabulary
that formalizes benchmark metadata, and showcase how it is supporting the
development of a Benchmark tool that enables benchmark exploration and
analysis.
Related papers
- "Is This It?": Towards Ecologically Valid Benchmarks for Situated Collaboration [16.25921668308458]
We develop benchmarks to assess the capabilities of large multimodal models for engaging in situated collaboration.
In contrast to existing benchmarks, in which question-answer pairs are generated post hoc over preexisting or synthetic datasets via templates, human annotators, or large language models, we propose and investigate an interactive system-driven approach.
We illustrate how the questions that arise are different in form and content from questions typically found in existing embodied question answering (EQA) benchmarks and discuss new real-world challenge problems brought to the fore.
arXiv Detail & Related papers (2024-08-30T12:41:23Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - A Theoretically Grounded Benchmark for Evaluating Machine Commonsense [6.725087407394836]
Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense.
TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs.
Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
arXiv Detail & Related papers (2022-03-23T04:06:01Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z) - CBench: Towards Better Evaluation of Question Answering Over Knowledge
Graphs [3.631024220680066]
We introduce CBench, an informative benchmarking suite for analyzing benchmarks and evaluating question answering systems.
CBench can be used to analyze existing benchmarks with respect to several fine-grained linguistic, syntactic, and structural properties of the questions and queries.
arXiv Detail & Related papers (2021-04-05T15:41:14Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - A Framework for Evaluation of Machine Reading Comprehension Gold
Standards [7.6250852763032375]
This paper proposes a unifying framework to investigate the present linguistic features, required reasoning and background knowledge and factual correctness.
The absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.
arXiv Detail & Related papers (2020-03-10T11:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.