Related papers: SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

URL: http://arxiv.org/abs/2508.10428v1
Date: Thu, 14 Aug 2025 07:58:01 GMT
Title: SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks
Authors: Pengbo Shen, Yaqing Wang, Ni Mu, Yao Luan, Runpeng Xie, Senhao Yang, Lexiang Wang, Hao Hu, Shuang Xu, Yiqin Yang, Bo Xu,
Abstract summary: Existing benchmarks for tasks like StarCraft II fail to capture the game's full complexity.<n>We present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimize text-based observations to tackle spatial reasoning challenges.<n>We introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution.
Score: 24.84821125790223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI's ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game's full complexity, such as its complete game context, diverse action spaces, and all playable races. To address this gap, we present SC2Arena, a benchmark that fully supports all playable races, low-level action spaces, and optimizes text-based observations to tackle spatial reasoning challenges. Complementing this, we introduce StarEvolve, a hierarchical framework that integrates strategic planning with tactical execution, featuring iterative self-correction and continuous improvement via fine-tuning on high-quality gameplay data. Its key components include a Planner-Executor-Verifier structure to break down gameplay, and a scoring system for selecting high-quality training samples. Comprehensive analysis using SC2Arena provides valuable insights into developing generalist agents that were not possible with previous benchmarks. Experimental results also demonstrate that our proposed StarEvolve achieves superior performance in strategic planning. Our code, environment, and algorithms are publicly available.

Related papers

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents [5.173133826653683]
We present TowerMind, a novel environment grounded in the tower defense subgenre of RTS games.<n>We design five benchmark levels to evaluate several widely used Large Language Models.<n>Results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions.
arXiv Detail & Related papers (2026-01-09T16:18:08Z)
EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems [59.66823584073748]
A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time.<n>We present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode.
arXiv Detail & Related papers (2025-10-15T07:16:28Z)
Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning [16.35236123729838]
We propose a hierarchical multi-agent framework that employs specialized imitation learning agents under a meta-controller called Strategic Planner (SP)<n>By expert demonstrations, each specialized agent learns a distinctive strategy, such as aerial support or defensive maneuvers, and produces coherent, structured multistep action sequences.<n>The SP then orchestrates these proposals into a single, environmentally adaptive plan that ensures local decisions align with long-term strategies.
arXiv Detail & Related papers (2025-08-08T05:57:12Z)
Who is a Better Player: LLM against LLM [53.46608216197315]
We propose an adversarial benchmarking framework to assess the comprehensive performance of Large Language Models (LLMs) through board games competition.<n>We introduce Qi Town, a specialized evaluation platform that supports 5 widely played games and involves 20 LLM-driven players.
arXiv Detail & Related papers (2025-08-05T06:41:47Z)
AVA: Attentive VLM Agent for Mastering StarCraft II [56.07921367623274]
We introduce Attentive VLM Agent (AVA), a multimodal StarCraft II agent that aligns artificial agent perception with the human gameplay experience.<n>Our agent addresses this limitation by incorporating RGB visual inputs and natural language observations that more closely simulate human cognitive processes during gameplay.
arXiv Detail & Related papers (2025-03-07T12:54:25Z)
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs [45.12542636218608]
We propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization.<n>Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark.<n>To provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units.
arXiv Detail & Related papers (2024-10-14T13:15:34Z)
Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach [7.693497788883165]
Large language model (LLM) agents, such as Voyage and MetaGPT, present the immense potential in solving intricate tasks. We propose a Chain of Summarization method, including single frame summarization for processing raw observations and multi frame summarization for analyzing game information. Experiment results demonstrate that: 1. LLMs possess the relevant knowledge and complex planning abilities needed to address StarCraft II scenarios; 2. Human experts consider the performance of LLM agents to be close to that of an average player who has played StarCraft II for eight years; 3. LLM agents are capable of defeating the built in AI
arXiv Detail & Related papers (2023-12-19T05:27:16Z)
Deep Policy Networks for NPC Behaviors that Adapt to Changing Design Parameters in Roguelike Games [137.86426963572214]
Turn-based strategy games like Roguelikes, for example, present unique challenges to Deep Reinforcement Learning (DRL) We propose two network architectures to better handle complex categorical state spaces and to mitigate the need for retraining forced by design decisions.
arXiv Detail & Related papers (2020-12-07T08:47:25Z)
The Design Of "Stratega": A General Strategy Games Framework [62.997667081978825]
Stratega is a framework for creating turn-based and real-time strategy games. The framework has been built with a focus on statistical forward planning (SFP) agents. We hope that the development of this framework and its respective agents helps to better understand the complex decision-making process in strategy games.
arXiv Detail & Related papers (2020-09-11T20:02:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.