Related papers: RuleSmith: Multi-Agent LLMs for Automated Game Balancing

RuleSmith: Multi-Agent LLMs for Automated Game Balancing

URL: http://arxiv.org/abs/2602.06232v1
Date: Thu, 05 Feb 2026 22:19:44 GMT
Title: RuleSmith: Multi-Agent LLMs for Automated Game Balancing
Authors: Ziyao Zeng, Chen Liu, Tianyu Liu, Hao Wang, Xiatao Sun, Fengyu Yang, Xiaofeng Liu, Zhiwen Fan,
Abstract summary: RuleSmith is the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs.<n>It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space.<n>We instantiate RuleSmith on CivMini, a simplified civilization-style game governed by tunable parameters.
Score: 35.69628235278724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.

Related papers

Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games [5.045496863924638]
We propose a generative agent-based MMO simulation system empowered by Large Language Models (LLMs)<n>LLMs adapt from general priors to game-specific domains, enabling realistic and interpretable player decision-making.<n>Experiments demonstrate strong consistency with real-world player behaviors and plausible causal responses under interventions.
arXiv Detail & Related papers (2025-12-02T03:01:17Z)
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games [29.194229891848853]
Orak is a benchmark designed to train and evaluate Large Language Model (LLM) agents across diverse real-world video games.<n>To support consistent evaluation of LLMs, we introduce a plug-and-play interface based on Model Context Protocol (MCP)<n>Orak offers a comprehensive evaluation framework, encompassing general game score leaderboards, LLM battle arenas, and in-depth analyses of visual input state, agentic strategies, and fine-tuning effects.
arXiv Detail & Related papers (2025-06-04T06:40:33Z)
Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching [23.0436612817548]
Nash Learning from Human Feedback is a framework for aligning large language models with human preferences by modeling learning as a zero-sum game.<n>In this paper, we study using what choices of payoff based on the pairwise human preferences can yield desirable alignment properties.
arXiv Detail & Related papers (2025-05-27T02:07:35Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames [3.5083201638203154]
We implement a role-based multi-agent strategic interaction framework tailored to sophisticated reasoners.<n>We use one-shot, 2-player beauty contests to evaluate the reasoning capabilities of the latest LLMs.<n>Our experiments show that artificial reasoners can outperform the baseline model in terms of both approximating human behaviour and reaching the optimal solution.
arXiv Detail & Related papers (2025-02-11T10:37:20Z)
RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines [34.002194150560086]
We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines.<n> RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS)
arXiv Detail & Related papers (2025-02-01T23:40:24Z)
GAMEBoT: Transparent Assessment of LLM Reasoning in Games [54.49589494014147]
GAMEBoT is a gaming arena designed for rigorous assessment of Large Language Models.<n>We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics.<n>Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
arXiv Detail & Related papers (2024-12-18T08:32:53Z)
Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey [52.00674453604779]
This survey organizes the literature on the design and optimization of emerging structures around post-trained LMs.<n>We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools.
arXiv Detail & Related papers (2024-10-21T18:06:25Z)
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models [87.49676980090555]
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. We introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs.
arXiv Detail & Related papers (2024-08-28T13:16:41Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [68.29746557968107]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.<n> Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications. This paper evaluates LLMs' reasoning abilities in competitive environments. We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.