Benchmarks for Automated Commonsense Reasoning: A Survey
- URL: http://arxiv.org/abs/2302.04752v1
- Date: Thu, 9 Feb 2023 16:34:30 GMT
- Title: Benchmarks for Automated Commonsense Reasoning: A Survey
- Authors: Ernest Davis
- Abstract summary: More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems.
This paper surveys the development and uses of AI commonsense benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: More than one hundred benchmarks have been developed to test the commonsense
knowledge and commonsense reasoning abilities of artificial intelligence (AI)
systems. However, these benchmarks are often flawed and many aspects of common
sense remain untested. Consequently, we do not currently have any reliable way
of measuring to what extent existing AI systems have achieved these abilities.
This paper surveys the development and uses of AI commonsense benchmarks. We
discuss the nature of common sense; the role of common sense in AI; the goals
served by constructing commonsense benchmarks; and desirable features of
commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue
that it is worthwhile to invest the work needed ensure that benchmark examples
are consistently high quality. We survey the various methods of constructing
commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been
developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated
physical environments. We discuss the gaps in the existing benchmarks and
aspects of commonsense reasoning that are not addressed in any existing
benchmark. We conclude with a number of recommendations for future development
of commonsense AI benchmarks.
Related papers
- BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? [90.30635552818875]
We present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs.
This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals.
We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets.
arXiv Detail & Related papers (2024-11-06T05:09:34Z) - ECBD: Evidence-Centered Benchmark Design for NLP [95.50252564938417]
We propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules.
Each module requires benchmark designers to describe, justify, and support benchmark design choices.
Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks' measurements.
arXiv Detail & Related papers (2024-06-13T00:59:55Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Introducing v0.5 of the AI Safety Benchmark from MLCommons [101.98401637778638]
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group.
The benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models.
arXiv Detail & Related papers (2024-04-18T15:01:00Z) - A Theoretically Grounded Benchmark for Evaluating Machine Commonsense [6.725087407394836]
Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense.
TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs.
Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
arXiv Detail & Related papers (2022-03-23T04:06:01Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - What Will it Take to Fix Benchmarking in Natural Language Understanding? [30.888416756627155]
We lay out four criteria that we argue NLU benchmarks should meet.
Restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets.
arXiv Detail & Related papers (2021-04-05T20:36:11Z) - Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches.
Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z) - Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies.
We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z) - Do Fine-tuned Commonsense Language Models Really Generalize? [8.591839265985412]
We study the generalization issue in detail by designing and conducting a rigorous scientific study.
We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup.
arXiv Detail & Related papers (2020-11-18T08:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.