Related papers: AGI-Elo: How Far Are We From Mastering A Task?

AGI-Elo: How Far Are We From Mastering A Task?

URL: http://arxiv.org/abs/2505.12844v2
Date: Sat, 24 May 2025 05:25:10 GMT
Title: AGI-Elo: How Far Are We From Mastering A Task?
Authors: Shuo Sun, Yimin Zhao, Christina Dao Wen Lee, Jiawei Sun, Chengran Yuan, Zefan Huang, Dongen Li, Justin KW Yeoh, Alok Prakash, Thomas W. Malone, Marcelo H. Ang Jr,
Abstract summary: This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains.<n>We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains.
Score: 8.378767006620294
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery.

Related papers

VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels [30.699079182148054]
We introduce VISTA, a visual analytics framework that improves data quality to enhance the performance of multi-modal models.<n>We show how VISTA integrates multi-phased data validation strategies with human expertise, enabling humans to identify, understand, and correct hidden issues within FM-generated labels.
arXiv Detail & Related papers (2025-07-11T20:17:23Z)
Vision Generalist Model: A Survey [87.49797517847132]
We provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field.<n>We take a brief excursion into related domains, shedding light on their interconnections and potential synergies.
arXiv Detail & Related papers (2025-06-11T17:23:41Z)
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges [72.19809898215857]
We introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains.<n>These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports.<n>We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured, creative solutions, and generates well-grounded, creative solutions.
arXiv Detail & Related papers (2025-05-21T03:33:23Z)
Automated Capability Discovery via Model Self-Exploration [5.404186221463082]
We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to propose open-ended tasks.<n>ACD automatically uncovers both surprising capabilities and failures in the subject model.<n>We demonstrate ACD across a range of foundation models, showing that it automatically reveals thousands of capabilities that would be challenging for any single team to uncover.
arXiv Detail & Related papers (2025-02-11T14:23:13Z)
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games [44.16513620589459]
We introduce BALROG, a novel benchmark to assess the agentic capabilities of Large Language Models (LLMs) and Vision Language Models (VLMs)<n>Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master.<n>Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks.
arXiv Detail & Related papers (2024-11-20T18:54:32Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Fusing Models with Complementary Expertise [42.099743709292866]
We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution. Our method is applicable to both discriminative and generative tasks. We extend our method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time.
arXiv Detail & Related papers (2023-10-02T18:31:35Z)
OpenAGI: When LLM Meets Domain Experts [51.86179657467822]
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents. We introduce OpenAGI, an open-source platform designed for solving multi-step, real-world tasks.
arXiv Detail & Related papers (2023-04-10T03:55:35Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy. We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space. We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.