Fugu-MT 論文翻訳(概要): PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

論文の概要: PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

arxiv url: http://arxiv.org/abs/2605.20873v1
Date: Wed, 20 May 2026 08:10:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.564974
Title: PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
Title（参考訳）: PlanningBench: 大規模言語モデルの評価とトレーニングのためのスケーラブルで検証可能な計画データの生成
Authors: Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou,
Abstract要約: 計画は大規模言語モデル(LLM)の基本的な機能である PlanningBenchは、評価とトレーニングの両方のためのスケーラブルで多様な検証可能な計画データを生成するためのフレームワークである。
参考スコア（独自算出の注目度）: 52.48858778580074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.
Abstract（参考訳）: 計画は、大規模言語モデル(LLM)の基本的な能力である。なぜなら、そのような複雑なタスクは、目標、制約、リソース、長期的な結果を、実行可能で検証可能なソリューションに調整するモデルを必要とするからである。しかし、既存の計画ベンチマークでは、通常、プランニングデータを制御可能な生成ターゲットではなく、固定されたインスタンスのコレクションとして扱う。これにより、シナリオのカバレッジが制限され、構造的なソースではなく表面レベルのプロキシとの結びつきが難しくなり、スケーラブルな生成、自動検証、計画指向のトレーニングのサポートが制限される。評価とトレーニングの両方のために、スケーラブルで多様な検証可能な計画データを生成するためのフレームワークであるPlanningBenchを紹介します。 PlanningBenchは実際の計画シナリオから始まり、実際のワークフローを30以上のタスクタイプ、サブタスク、制約家族、難易度要素からなる構造化された分類に抽象化する。この分類法によって導かれた制約駆動型合成パイプラインは、適応的難易度制御、品質フィルタリング、インスタンスレベルの検証チェックリストを備えた自己完結型計画問題をインスタンス化する。これにより、計画データの構築を固定されたベンチマークコレクションからコントロール可能な生成に移行し、現実的なタスクグラウンドを維持できる。オープンソースおよびクローズドソースのフロンティアLCMの評価にはPlanningBenchを使用します。評価以外にも、検証済みのPlanningBenchデータによる強化学習は、目に見えない計画ベンチマークとより広範な指示追従タスクのパフォーマンスを改善する。さらなる分析により、決定的あるいは適切に特定された最適解は、より明確な報酬信号とより安定した訓練力学をもたらすことが示唆される。全体として、PlanningBenchはLLMの汎用的な計画能力の診断と改善のための、制御可能なプランニングデータのソースを提供する。

論文の概要: PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

関連論文リスト