Related papers: The AI Productivity Index (APEX)

The AI Productivity Index (APEX)

URL: http://arxiv.org/abs/2509.25721v2
Date: Thu, 02 Oct 2025 05:47:47 GMT
Title: The AI Productivity Index (APEX)
Authors: Bertie Vidgen, Abby Fennelly, Evan Pinnix, Chirag Mahapatra, Zach Richards, Austin Bridges, Calix Huang, Ben Hunsberger, Fez Zafar, Brendan Foody, Dominic Barton, Cass R. Sunstein, Eric Topol, Osvald Nitski,
Abstract summary: We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value.<n>APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care.<n>We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%)
Score: 4.122962658725304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models' ability to produce economically valuable work.

Related papers

The Strategic Foresight of LLMs: Evidence from a Fully Prospective Venture Tournament [0.19116784879310025]
We benchmarked forecasts against 346 experienced managers recruited via Prolific and three MBA-trained investors working under monitored conditions.<n>The results are striking: human evaluators achieved rank correlations with actual outcomes between 0.04 and 0.45, while several frontier LLMs exceeded 0.60, with the best (Gemini 2.5 Pro) reaching 0.74.<n>Neither wisdom-of-the-crowd ensembles nor human-AI hybrid teams outperformed the best standalone model.
arXiv Detail & Related papers (2026-02-02T05:52:16Z)
APEX-Agents [4.209210727546437]
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks.<n> Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High)
arXiv Detail & Related papers (2026-01-20T18:53:44Z)
Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies [57.11324429385405]
We introduce TaxoBench, a diagnostic benchmark derived from 72 computer science surveys.<n>We manually extract expert-authored taxonomy trees containing 3,815 precisely categorized citations as ground truth.<n>Best agent recalls only 20.9% of expert-selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization.
arXiv Detail & Related papers (2026-01-18T11:57:09Z)
APEX-SWE [4.927317067589892]
We introduce the AI Productivity Index for Software Engineering (APEX-SWE)<n>APEX-SWE is a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work.<n> Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25%.
arXiv Detail & Related papers (2026-01-13T18:44:08Z)
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image [58.14192385042352]
We introduce Multimodal RewardBench 2 (MMRB2), the first benchmark for reward models on multimodal understanding and (interleaved) generation.<n>MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning.<n>It provides 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks.
arXiv Detail & Related papers (2025-12-18T18:56:04Z)
Predicting Empirical AI Research Outcomes with Language Models [27.148683265085012]
Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute.<n>We build the first benchmark for this task and compare LMs with human experts.<n>We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model's cut-off date for testing.<n>We develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with.<n>In the NLP domain, our system beats human experts by a large margin (64.4% v.s. 48.
arXiv Detail & Related papers (2025-06-01T02:46:31Z)
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.<n>Current assessment methods primarily rely on AI-annotated preference labels from traditional tasks.<n>We introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.<n>We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts [4.06186944042499]
We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts.<n>We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.<n>Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
arXiv Detail & Related papers (2024-11-22T18:30:46Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation [38.076276626337766]
MMEvalPro is a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics.<n>MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions.<n>Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging.
arXiv Detail & Related papers (2024-06-29T15:28:45Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Professional Certification Benchmark Dataset: The First 500 Jobs For Large Language Models [0.0]
The research creates a professional certification survey to test large language models and evaluate their employable skills. It compares the performance of two AI models, GPT-3 and Turbo-GPT3.5, on a benchmark dataset of 1149 professional certifications.
arXiv Detail & Related papers (2023-05-07T00:56:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.