AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
- URL: http://arxiv.org/abs/2602.17594v1
- Date: Thu, 19 Feb 2026 18:17:25 GMT
- Title: AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
- Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum,
- Abstract summary: We introduce the AI GameStore, a scalable and open-ended platform to synthesize new representative human games.<n>We generate 100 such games based on the top charts of Apple App Store and Steam, and evaluate seven frontier vision-language models (VLMs) on short episodes of play.<n>The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning.
- Score: 63.29377274531968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
Related papers
- People use fast, flat goal-directed simulation to reason about novel problems [68.55490343866545]
We show that people are systematic and adaptively rational in how they play a game for the first time.<n>We explain these capacities via a computational cognitive model that we call the "Intuitive Gamer"<n>Our work offers new insights into how people rapidly evaluate, act, and make suggestions when encountering novel problems.
arXiv Detail & Related papers (2025-10-13T15:12:08Z) - The Many Challenges of Human-Like Agents in Virtual Game Environments [1.1586742546971471]
This article is a survey of the most significant challenges in implementing human-like AI in games.<n>A machine-learning approach using a custom deep recurrent convolutional neural network is presented.<n>We hypothesize that the more challenging it is to create human-like AI for a given game, the easier it becomes to develop a method for distinguishing humans from AI-driven players.
arXiv Detail & Related papers (2025-05-26T14:00:39Z) - Human-like Bots for Tactical Shooters Using Compute-Efficient Sensors [13.743654443419384]
This paper introduces a novel methodology for training neural networks via imitation learning to play a complex, commercial-standard, VALORANT-like 2v2 tactical shooter game.<n>Our approach leverages an innovative, pixel-free perception architecture using a small set of ray-cast sensors, which capture essential spatial information efficiently.<n>Human evaluation tests confirm that our AI agents provide human-like gameplay experiences while operating efficiently under computational constraints.
arXiv Detail & Related papers (2024-12-30T12:06:37Z) - HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation [50.616995671367704]
We present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands.
Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies.
arXiv Detail & Related papers (2024-03-15T17:45:44Z) - Can Machines Imitate Humans? Integrative Turing-like tests for Language and Vision Demonstrate a Narrowing Gap [56.611702960809644]
We benchmark AI's ability to imitate humans in three language tasks and three vision tasks.<n>Next, we conducted 72,191 Turing-like tests with 1,916 human judges and 10 AI judges.<n>Imitation ability showed minimal correlation with conventional AI performance metrics.
arXiv Detail & Related papers (2022-11-23T16:16:52Z) - JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs.
Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge.
Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z) - Generative Personas That Behave and Experience Like Humans [3.611888922173257]
generative AI agents attempt to imitate particular playing behaviors represented as rules, rewards, or human demonstrations.
We extend the notion of behavioral procedural personas to cater for player experience, thus examining generative agents that can both behave and experience their game as humans would.
Our findings suggest that the generated agents exhibit distinctive play styles and experience responses of the human personas they were designed to imitate.
arXiv Detail & Related papers (2022-08-26T12:04:53Z) - WinoGAViL: Gamified Association Benchmark to Challenge
Vision-and-Language Models [91.92346150646007]
In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations.
We use the game to collect 3.5K instances, finding that they are intuitive for humans but challenging for state-of-the-art AI models.
Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills.
arXiv Detail & Related papers (2022-07-25T23:57:44Z) - Aligning Superhuman AI with Human Behavior: Chess as a Model System [5.236087378443016]
We develop Maia, a customized version of Alpha-Zero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines.
For a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines.
arXiv Detail & Related papers (2020-06-02T18:12:52Z) - Suphx: Mastering Mahjong with Deep Reinforcement Learning [114.68233321904623]
We design an AI for Mahjong, named Suphx, based on deep reinforcement learning with some newly introduced techniques.
Suphx has demonstrated stronger performance than most top human players in terms of stable rank.
This is the first time that a computer program outperforms most top human players in Mahjong.
arXiv Detail & Related papers (2020-03-30T16:18:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.