Fugu-MT 論文翻訳(概要): Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

論文の概要: Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

arxiv url: http://arxiv.org/abs/2606.21740v1
Date: Fri, 19 Jun 2026 20:53:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 03:37:35.472861
Title: Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents
Title（参考訳）: オーケストラの訓練: LLMエージェントを用いたエンドツーエンドPDDLプランニング
Authors: Rajesh Mangannavar, Zachary Coalson, Pranay Dugar, Prasad Tadepalli,
Abstract要約: 本稿では,外部検証器が有効計画の終了を認証したことを示す改良軌道からオーケストラを訓練するHALO(Hybrid Agent-Learned Orchestrator)を提案する。各ステップでフロンティア LLM を促すアプローチや,エポゾドの粗末な報酬からオーケストレータを学習するアプローチとは違って,検証がすでに強力なガイダンスを提供しています。 PlanBench、Natural Plan、および古典的な計画ベンチマーク全体において、HALOはGPT-5-miniのベースラインを成功率で一致または超えている。
参考スコア（独自算出の注目度）: 7.954705422811771
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Translating natural-language planning intent into verified plans is a longstanding challenge: people communicate goals in language, while classical planners require formal PDDL specifications. Recent agentic frameworks bridge this gap by orchestrating a pool of specialized repair agents inside a verifier-checked refinement loop, but the orchestrator at the centre is itself a prompted frontier LLM, paying a frontier-LLM API call at every refinement step. We present HALO (Hybrid Agent-Learned Orchestrator), which trains the orchestrator from refinement trajectories that an external verifier has certified as ending in valid plans, across 11 PDDL domains. HALO pairs a small QLoRA-tuned policy with three hardcoded rules for trivially decidable selections, and operates over an expanded 21-agent action space. Unlike approaches that prompt a frontier LLM at every step or learn an orchestrator from sparse end-of-episode rewards, our key observation is that the verifier already provides strong guidance: every accepted trajectory is a sequence of demonstrably correct (state, agent) decisions, directly usable as supervision. Across PlanBench, Natural Plan, and classical planning benchmarks, HALO matches or exceeds the GPT-5-mini prompted baseline on success rate, sits within three percentage points of the stronger Gemini-3-Flash prompted baseline, reduces orchestration cost by more than an order of magnitude (\$0.18 to \$0.004 per task against GPT-5-mini, roughly 45$\times$ cheaper; roughly 15$\times$ cheaper than Gemini-3-Flash), and cuts total LLM calls per episode by 40 to 50 percent.
Abstract（参考訳）: 自然言語の計画意図を検証された計画に翻訳することは、人々が言語で目標を伝達するのに対して、古典的なプランナーは正式なPDDL仕様を必要とする、という長年の課題である。最近のエージェントフレームワークは、検証済みリファインメントループ内で特別な修復エージェントのプールを編成することでこのギャップを埋めるが、中央のオーケストレータはそれ自体はフロンティアLLMであり、すべてのリファインメントステップでフロンティアLLM APIコールを支払う。本稿では,11個のPDDLドメインにまたがって,外部検証器が有効計画の終了を認証したことを示す改良軌道からオーケストラを訓練するHALO(Hybrid Agent-Learned Orchestrator)を提案する。 HALOは、簡単に決定可能な選択のための3つのハードコードされたルールとQLoRAで調整された小さなポリシーをペアリングし、拡張された21エージェントアクション空間上で動作する。あらゆるステップでフロンティアのLSMを促したり、オーケストレータの粗末な報酬からオーケストレータを学ぶアプローチとは違って、我々の重要な観察では、検証者がすでに強力なガイダンスを提供しており、全ての軌道は実証可能な正しい(状態、エージェント)決定のシーケンスであり、監督として直接使用することができる。 PlanBench、Natural Plan、そして古典的な計画ベンチマーク全体において、HALOは成功率のGPT-5-miniの基準値と一致または超過し、より強力なGemini-3-Flashの基準値の3ポイント以内に収まり、GPT-5-miniに対する1タスク当たり0.18ドルから0.004ドルまでのオーケストレーションコストを1桁以上削減し、約45$\times$、約15$\times$Gemini-3-Flashよりも安い。

論文の概要: Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

関連論文リスト