Benchmarks for Physical Reasoning AI
- URL: http://arxiv.org/abs/2312.10728v1
- Date: Sun, 17 Dec 2023 14:24:03 GMT
- Title: Benchmarks for Physical Reasoning AI
- Authors: Andrew Melnik, Robin Schiewer, Moritz Lange, Andrei Muresanu, Mozhgan
Saeidi, Animesh Garg, Helge Ritter
- Abstract summary: We offer an overview of existing benchmarks and their solution approaches for measuring the physical reasoning capacity of AI systems.
We select benchmarks that are designed to test algorithmic performance in physical reasoning tasks.
We group the presented set of physical reasoning benchmarks into subcategories so that more narrow generalist AI agents can be tested first on these groups.
- Score: 28.02418565463541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Physical reasoning is a crucial aspect in the development of general AI
systems, given that human learning starts with interacting with the physical
world before progressing to more complex concepts. Although researchers have
studied and assessed the physical reasoning of AI approaches through various
specific benchmarks, there is no comprehensive approach to evaluating and
measuring progress. Therefore, we aim to offer an overview of existing
benchmarks and their solution approaches and propose a unified perspective for
measuring the physical reasoning capacity of AI systems. We select benchmarks
that are designed to test algorithmic performance in physical reasoning tasks.
While each of the selected benchmarks poses a unique challenge, their ensemble
provides a comprehensive proving ground for an AI generalist agent with a
measurable skill level for various physical reasoning concepts. This gives an
advantage to such an ensemble of benchmarks over other holistic benchmarks that
aim to simulate the real world by intertwining its complexity and many
concepts. We group the presented set of physical reasoning benchmarks into
subcategories so that more narrow generalist AI agents can be tested first on
these groups.
Related papers
- Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems.
The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
arXiv Detail & Related papers (2024-11-17T02:35:30Z) - OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.
These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.
Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z) - Levels of AGI for Operationalizing Progress on the Path to AGI [64.59151650272477]
We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors.
This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI.
arXiv Detail & Related papers (2023-11-04T17:44:58Z) - Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - A System's Approach Taxonomy for User-Centred XAI: A Survey [0.6882042556551609]
We propose a unified, inclusive and user-centred taxonomy for XAI based on the principles of General System's Theory.
This provides a basis for evaluating the appropriateness of XAI approaches for all user types, including both developers and end users.
arXiv Detail & Related papers (2023-03-06T00:50:23Z) - Mapping global dynamics of benchmark creation and saturation in
artificial intelligence [5.233652342195164]
We create maps of the global dynamics of benchmark creation and saturation.
We curated data for 1688 benchmarks covering the entire domains of computer vision and natural language processing.
arXiv Detail & Related papers (2022-03-09T09:16:49Z) - Phy-Q: A Benchmark for Physical Reasoning [5.45672244836119]
We propose a new benchmark that requires an agent to reason about physical scenarios and take an action accordingly.
Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios.
For each scenario, we create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific physical rule.
arXiv Detail & Related papers (2021-08-31T09:11:27Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - An interdisciplinary conceptual study of Artificial Intelligence (AI)
for helping benefit-risk assessment practices: Towards a comprehensive
qualification matrix of AI programs and devices (pre-print 2020) [55.41644538483948]
This paper proposes a comprehensive analysis of existing concepts coming from different disciplines tackling the notion of intelligence.
The aim is to identify shared notions or discrepancies to consider for qualifying AI systems.
arXiv Detail & Related papers (2021-05-07T12:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.