InnoGym: Benchmarking the Innovation Potential of AI Agents
- URL: http://arxiv.org/abs/2512.01822v1
- Date: Mon, 01 Dec 2025 16:03:04 GMT
- Title: InnoGym: Benchmarking the Innovation Potential of AI Agents
- Authors: Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang,
- Abstract summary: InnoGym is the first benchmark designed to evaluate the innovation potential of AI agents.<n>InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches.
- Score: 74.64144272881414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
Related papers
- ProxyWar: Dynamic Assessment of LLM Code Generation in Game Arenas [11.101957427633614]
We present ProxyWar, a novel framework that systematically assesses code generation quality.<n>Unlike existing approaches, ProxyWar evaluates not only functional correctness but also the operational characteristics of generated programs.
arXiv Detail & Related papers (2026-02-04T07:57:06Z) - Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z) - AlphaResearch: Accelerating New Algorithm Discovery with Language Models [60.502137348923156]
Large language models have made significant progress in complex but easy-to-verify problems, yet they still struggle with discovering the unknown.<n>We present textbfAlphaResearch, an autonomous research agent designed to discover new algorithms on open-ended problems.
arXiv Detail & Related papers (2025-11-11T18:03:22Z) - Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents [1.0305173936249623]
This white paper proposes a novel framework of eleven outcome-based, task-agnostic performance metrics for AI agents.<n>We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE)<n>Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model.
arXiv Detail & Related papers (2025-11-11T13:40:46Z) - Barbarians at the Gate: How AI is Upending Systems Research [58.95406995634148]
We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery.<n>We term this approach as AI-Driven Research for Systems ( ADRS), which iteratively generates, evaluates, and refines solutions.<n>Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.
arXiv Detail & Related papers (2025-10-07T17:49:24Z) - HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization [31.908590128913094]
HeuriGym is an agentic framework designed for evaluating algorithms generated by Large Language Models (LLMs)<n>We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning.<n>Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
arXiv Detail & Related papers (2025-06-09T17:46:47Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - Don't Bet on Luck Alone: Enhancing Behavioral Reproducibility of
Quality-Diversity Solutions in Uncertain Domains [2.639902239625779]
We introduce Archive Reproducibility Improvement Algorithm (ARIA)
ARIA is a plug-and-play approach that improves the quality of solutions present in an archive.
We show that our algorithm enhances the quality and descriptor space coverage of any given archive by at least 50%.
arXiv Detail & Related papers (2023-04-07T14:45:14Z) - The Meta-Evaluation Problem in Explainable AI: Identifying Reliable
Estimators with MetaQuantus [10.135749005469686]
One of the unsolved challenges in the field of Explainable AI (XAI) is determining how to most reliably estimate the quality of an explanation method.
We address this issue through a meta-evaluation of different quality estimators in XAI.
Our novel framework, MetaQuantus, analyses two complementary performance characteristics of a quality estimator.
arXiv Detail & Related papers (2023-02-14T18:59:02Z) - IM-IAD: Industrial Image Anomaly Detection Benchmark in Manufacturing [88.35145788575348]
Image anomaly detection (IAD) is an emerging and vital computer vision task in industrial manufacturing.
The lack of a uniform IM benchmark is hindering the development and usage of IAD methods in real-world applications.
We construct a comprehensive image anomaly detection benchmark (IM-IAD), which includes 19 algorithms on seven major datasets.
arXiv Detail & Related papers (2023-01-31T01:24:45Z) - A generalized framework for active learning reliability: survey and
benchmark [0.0]
We propose a modular framework to build on-the-fly efficient active learning strategies.
We devise 39 strategies for the solution of 20 reliability benchmark problems.
arXiv Detail & Related papers (2021-06-03T09:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.