Related papers: OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

URL: http://arxiv.org/abs/2406.12753v1
Date: Tue, 18 Jun 2024 16:20:53 GMT
Title: OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
Authors: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang, Dahua Lin, Yu Qiao, Pengfei Liu,
Abstract summary: We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
Score: 73.75520820608232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.

Related papers

Towards Autonomous Mathematics Research [48.29504087871558]
We introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language.<n>Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems.<n>We demonstrate Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research.
arXiv Detail & Related papers (2026-02-10T18:50:15Z)
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning [118.46980291324148]
ATLAS is a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems.<n>Its key features include: High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage.<n>Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities.
arXiv Detail & Related papers (2025-11-18T11:13:06Z)
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z)
AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5452803680643]
We present a comprehensive survey on AI for Research (AI4Research)<n>We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research.<n>We identify key research gaps and highlight promising future directions.
arXiv Detail & Related papers (2025-07-02T17:19:20Z)
Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison [15.814479753448412]
Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks.<n>In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges.<n>Our study analyzes how these features affect generative AI performance compared to students.
arXiv Detail & Related papers (2025-07-02T12:06:46Z)
AI Education in a Mirror: Challenges Faced by Academic and Industry Experts [15.332866859177747]
This study provides preliminary insights into challenges AI professionals encounter in both academia and industry.<n>We identify key challenges related to data quality and availability, model scalability, practical constraints, user behavior, and explainability.<n>These exploratory findings suggest that AI curricula could better integrate real-world complexities, software engineering principles, and interdisciplinary learning.
arXiv Detail & Related papers (2025-05-02T16:52:49Z)
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI [0.0]
OpenAI's o3 achieves a high score of 87.5 % on ARC-AGI, a benchmark proposed to measure intelligence. This raises the question whether systems based on Large Language Models (LLMs), particularly o3, demonstrate intelligence and progress towards artificial general intelligence (AGI)
arXiv Detail & Related papers (2025-01-13T16:28:01Z)
Artificial Intelligence for Collective Intelligence: A National-Scale Research Strategy [7.644091133650435]
Pressing challenges in healthcare, finance, infrastructure and sustainability might all be productively addressed by leveraging AI for national-scale collective intelligence. The development and deployment of this kind of AI faces distinctive challenges, both technical and socio-technical. Here, a research strategy for mobilising inter-disciplinary research to address these challenges is detailed and some of the key issues that must be faced are outlined.
arXiv Detail & Related papers (2024-11-09T15:25:43Z)
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA [43.116608441891096]
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning. State-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval.
arXiv Detail & Related papers (2024-10-09T03:53:26Z)
Evaluation of OpenAI o1: Opportunities and Challenges of AGI [112.0812059747033]
o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance. The model excelled in tasks requiring intricate reasoning and knowledge integration across various fields. Overall results indicate significant progress towards artificial general intelligence.
arXiv Detail & Related papers (2024-09-27T06:57:00Z)
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI [129.08019405056262]
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial Intelligence (AGI) MLMs andWMs have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI.
arXiv Detail & Related papers (2024-07-09T14:14:47Z)
Applications of Explainable artificial intelligence in Earth system science [12.454478986296152]
This review aims to provide a foundational understanding of explainable AI (XAI) XAI offers a set of powerful tools that make the models more transparent. We identify four significant challenges that XAI faces within the Earth system science (ESS) A visionary outlook for ESS envisions a harmonious blend where process-based models govern the known, AI models explore the unknown, and XAI bridges the gap by providing explanations.
arXiv Detail & Related papers (2024-06-12T15:05:29Z)
DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z)
Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems [268.585904751315]
New area of research known as AI for science (AI4Science) Areas aim at understanding the physical world from subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales. Key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods.
arXiv Detail & Related papers (2023-07-17T12:14:14Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
Explainable Artificial Intelligence Approaches: A Survey [0.22940141855172028]
Lack of explainability of a decision from an Artificial Intelligence based "black box" system/model is a key stumbling block for adopting AI in high stakes applications. We demonstrate popular Explainable Artificial Intelligence (XAI) methods with a mutual case study/task. We analyze for competitive advantages from multiple perspectives. We recommend paths towards responsible or human-centered AI using XAI as a medium.
arXiv Detail & Related papers (2021-01-23T06:15:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.