A Theoretically Grounded Benchmark for Evaluating Machine Commonsense
- URL: http://arxiv.org/abs/2203.12184v1
- Date: Wed, 23 Mar 2022 04:06:01 GMT
- Title: A Theoretically Grounded Benchmark for Evaluating Machine Commonsense
- Authors: Henrique Santos, Ke Shen, Alice M. Mulvehill, Yasaman Razeghi, Deborah
L. McGuinness, Mayank Kejriwal
- Abstract summary: Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense.
TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs.
Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
- Score: 6.725087407394836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Programming machines with commonsense reasoning (CSR) abilities is a
longstanding challenge in the Artificial Intelligence community. Current CSR
benchmarks use multiple-choice (and in relatively fewer cases, generative)
question-answering instances to evaluate machine commonsense. Recent progress
in transformer-based language representation models suggest that considerable
progress has been made on existing benchmarks. However, although tens of CSR
benchmarks currently exist, and are growing, it is not evident that the full
suite of commonsense capabilities have been systematically evaluated.
Furthermore, there are doubts about whether language models are 'fitting' to a
benchmark dataset's training partition by picking up on subtle, but normatively
irrelevant (at least for CSR), statistical features to achieve good performance
on the testing partition. To address these challenges, we propose a benchmark
called Theoretically-Grounded Commonsense Reasoning (TG-CSR) that is also based
on discriminative question answering, but with questions designed to evaluate
diverse aspects of commonsense, such as space, time, and world states. TG-CSR
is based on a subset of commonsense categories first proposed as a viable
theory of commonsense by Gordon and Hobbs. The benchmark is also designed to be
few-shot (and in the future, zero-shot), with only a few training and
validation examples provided. This report discusses the structure and
construction of the benchmark. Preliminary results suggest that the benchmark
is challenging even for advanced language representation models designed for
discriminative CSR question answering tasks.
Benchmark access and leaderboard:
https://codalab.lisn.upsaclay.fr/competitions/3080 Benchmark website:
https://usc-isi-i2.github.io/TGCSR/
Related papers
- Can we hop in general? A discussion of benchmark selection and design using the Hopper environment [12.18012293738896]
We argue that benchmarking in reinforcement learning needs to be treated as a scientific discipline itself.
Case study shows that the selection of standard benchmarking suites can drastically change how we judge performance of algorithms.
arXiv Detail & Related papers (2024-10-11T14:47:22Z) - IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios [14.336896748878921]
This paper introduces the IRSC benchmark for evaluating the performance of embedding models in multilingual RAG tasks.
The benchmark encompasses five retrieval tasks: query retrieval, title retrieval, part-of-paragraph retrieval, keyword retrieval, and summary retrieval.
Our contributions include: 1) the IRSC benchmark, 2) the SSCI and RCCI metrics, and 3) insights into the cross-lingual limitations of embedding models.
arXiv Detail & Related papers (2024-09-24T05:39:53Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - ARB: Advanced Reasoning Benchmark for Large Language Models [94.37521840642141]
We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge.
We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks.
arXiv Detail & Related papers (2023-07-25T17:55:19Z) - Benchmarks for Automated Commonsense Reasoning: A Survey [0.0]
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems.
This paper surveys the development and uses of AI commonsense benchmarks.
arXiv Detail & Related papers (2023-02-09T16:34:30Z) - Open-Set Recognition: A Good Closed-Set Classifier is All You Need [146.6814176602689]
We show that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes.
We use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy.
We also construct new benchmarks which better respect the task of detecting semantic novelty.
arXiv Detail & Related papers (2021-10-12T17:58:59Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Do Question Answering Modeling Improvements Hold Across Benchmarks? [84.48867898593052]
We measure concurrence between 32 QA benchmarks on a set of 20 diverse modeling approaches.
Despite years of intense community focus on a small number of benchmarks, the modeling improvements studied hold broadly.
arXiv Detail & Related papers (2021-02-01T18:55:38Z) - Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies.
We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z) - Do Fine-tuned Commonsense Language Models Really Generalize? [8.591839265985412]
We study the generalization issue in detail by designing and conducting a rigorous scientific study.
We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup.
arXiv Detail & Related papers (2020-11-18T08:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.