Fugu-MT 論文翻訳(概要): Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

論文の概要: Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

arxiv url: http://arxiv.org/abs/2603.09337v1
Date: Tue, 10 Mar 2026 08:14:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.143452
Title: Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments
Title（参考訳）: スケーリングを超えて:ゼロサム環境におけるLSMの戦略的推論と迅速な意思決定能力を評価する
Authors: Yang Li, Xing Chen, Yutao Liu, Gege Qi, Yanxian BI, Zizhe Wang, Yunjian Zhang, Yao Zhu,
Abstract要約: 本稿では,マルチエージェント評価フレームワークであるStrategic Tactical Agent Reasoning (STAR) Benchmarkを紹介する。 STARはターンベースとリアルタイムの両方の設定をサポートし、長期戦略計画の制御可能な分析を可能にする。対話型環境における戦略的インテリジェンスは、推論の深さだけでなく、計画をタイムリーな行動に変換する能力にも依存することを示す。
参考スコア（独自算出の注目度）: 15.538910160052964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.
Abstract（参考訳）: 大規模言語モデル (LLM) は静的推論ベンチマークにおいて高い性能を達成しているが, 対向的かつ時間に敏感な環境で動作する対話型エージェントとしての有効性はいまだによく分かっていない。既存の評価は、推論を単一ショットの能力として扱い、相手が認識する意思決定、時間的制約、圧力下での実行といった課題を見落としている。本稿では1v1ゼロサム競合相互作用によるLCMの評価を行うマルチエージェント評価フレームワークであるStrategic Tactical Agent Reasoning (STAR) Benchmarkを紹介する。 STARはターンベースとリアルタイムの両方の設定をサポートし、長期戦略計画の制御分析と統合環境での迅速な戦術実行を可能にする。標準化されたAPIと完全に実装された実行エンジンを備えたモジュールアーキテクチャ上に構築されたSTARは、再現可能な評価と柔軟なタスクカスタマイズを容易にする。 2連勝結果を超えて、競争的な成功だけでなく、実行効率や結果安定性といった戦略的行動の質も評価する戦略評価スイートを導入する。推論集約モデルがターンベースの設定を支配しているのに対して、推論遅延は、命令チューニングの高速なモデルが普及するリアルタイムシナリオにおいて、しばしばパフォーマンスが低下する。これらの結果から, 対話環境における戦略的インテリジェンスは, 推論深度だけでなく, 計画をタイムリーな行動に変換する能力にも依存していることが示唆された。

論文の概要: Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments

関連論文リスト