Fugu-MT 論文翻訳(概要): ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

論文の概要: ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

arxiv url: http://arxiv.org/abs/2603.18614v1
Date: Thu, 19 Mar 2026 08:33:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.034488
Title: ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs
Title（参考訳）: ZEBRAARENA:ツール強化LLMにおける推論・アクション結合の診断シミュレーション環境
Authors: Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen,
Abstract要約: ツール強化された大規模言語モデルにおける推論と反応の結合を研究するための診断環境であるZebraArenaを紹介する。 ZebraArenaの各タスクは、ターゲットツールの使用を通じてのみ利用できる重要な情報のセットを必要とする。 ZebraArenaには、詳細な推論と正確な外部ツール呼び出しの組み合わせが必要ですが、これは依然として課題です。
参考スコア（独自算出の注目度）: 54.7743875084328
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.
Abstract（参考訳）: ツール強化された大規模言語モデル(LLM)は、外部アクションとマルチステップ推論を密に結合する必要があるが、既存のベンチマークでは、複雑な環境ダイナミクス、記憶された知識、データセット汚染とこの相互作用を混同することが多い。本稿では,ツール拡張LDMにおける推論-動作結合を研究するための手続き的に生成された診断環境であるZebraArenaについて紹介する。 ZebraArenaの各タスクは、ターゲットツールの使用によってのみ利用できる重要な情報のセットを必要とし、外部情報取得と推論の解釈可能なインターフェースを提供する。この設計は、一意のソリューションによる決定論的評価と、効率的なツール使用量を測定するための理論的最適クエリカウントを提供する。これは、GPT-5やGemini 2.5 Proのようなフロンティア推論モデルがハードインスタンス上で60%の精度しか達成できないため、依然として課題である。また,理論的最適性と実用ツール利用の相違点を持続的に観察する。例えば、GPT-5は理論的な最適化よりも70-270%多いツールコールを使用する。評価において重要な知見が強調され,ゼブラアリーナが内的推論と外的行動との相互作用についてさらなる研究を奨励することを期待している。

論文の概要: ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

関連論文リスト