OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
- URL: http://arxiv.org/abs/2406.12753v1
- Date: Tue, 18 Jun 2024 16:20:53 GMT
- Title: OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
- Authors: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu,
- Abstract summary: We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.
These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.
Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
- Score: 73.75520820608232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
Related papers
- Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [95.96983812740683]
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial Intelligence (AGI)
MLMs andWMs have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities.
In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI.
arXiv Detail & Related papers (2024-07-09T14:14:47Z) - Applications of Explainable artificial intelligence in Earth system science [12.454478986296152]
This review aims to provide a foundational understanding of explainable AI (XAI)
XAI offers a set of powerful tools that make the models more transparent.
We identify four significant challenges that XAI faces within the Earth system science (ESS)
A visionary outlook for ESS envisions a harmonious blend where process-based models govern the known, AI models explore the unknown, and XAI bridges the gap by providing explanations.
arXiv Detail & Related papers (2024-06-12T15:05:29Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - Cognition is All You Need -- The Next Layer of AI Above Large Language
Models [0.0]
We present Cognitive AI, a framework for neurosymbolic cognition outside of large language models.
We propose that Cognitive AI is a necessary precursor for the evolution of the forms of AI, such as AGI, and specifically claim that AGI cannot be achieved by probabilistic approaches on their own.
We conclude with a discussion of the implications for large language models, adoption cycles in AI, and commercial Cognitive AI development.
arXiv Detail & Related papers (2024-03-04T16:11:57Z) - OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions.
The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics.
Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z) - Benchmarks for Physical Reasoning AI [28.02418565463541]
We offer an overview of existing benchmarks and their solution approaches for measuring the physical reasoning capacity of AI systems.
We select benchmarks that are designed to test algorithmic performance in physical reasoning tasks.
We group the presented set of physical reasoning benchmarks into subcategories so that more narrow generalist AI agents can be tested first on these groups.
arXiv Detail & Related papers (2023-12-17T14:24:03Z) - General Purpose Artificial Intelligence Systems (GPAIS): Properties,
Definition, Taxonomy, Societal Implications and Responsible Governance [16.030931070783637]
General-Purpose Artificial Intelligence Systems (GPAIS) has been defined to refer to these AI systems.
To date, the possibility of an Artificial General Intelligence, powerful enough to perform any intellectual task as if it were human, or even improve it, has remained an aspiration, fiction, and considered a risk for our society.
This work discusses existing definitions for GPAIS and proposes a new definition that allows for a gradual differentiation among types of GPAIS according to their properties and limitations.
arXiv Detail & Related papers (2023-07-26T16:35:48Z) - Artificial Intelligence for Science in Quantum, Atomistic, and Continuum
Systems [245.1050780515017]
New area of research known as AI for science (AI4Science)
Areas aim at understanding the physical world from subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales.
Key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods.
arXiv Detail & Related papers (2023-07-17T12:14:14Z) - Brain in a Vat: On Missing Pieces Towards Artificial General
Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents.
We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations.
We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z) - Explainable Artificial Intelligence Approaches: A Survey [0.22940141855172028]
Lack of explainability of a decision from an Artificial Intelligence based "black box" system/model is a key stumbling block for adopting AI in high stakes applications.
We demonstrate popular Explainable Artificial Intelligence (XAI) methods with a mutual case study/task.
We analyze for competitive advantages from multiple perspectives.
We recommend paths towards responsible or human-centered AI using XAI as a medium.
arXiv Detail & Related papers (2021-01-23T06:15:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.