Fugu-MT 論文翻訳(概要): Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

論文の概要: Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

arxiv url: http://arxiv.org/abs/2510.01833v1
Date: Thu, 02 Oct 2025 09:28:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:21.075589
Title: Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
Title（参考訳）: Plan Then Action:LLM推論のための高レベルプランニングガイダンス強化学習
Authors: Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Zhiqiang Gao, Shufei Zhang, Sumon Biswas,
Abstract要約: 本稿では,高レベルプランニングときめ細かいCoT推論の両方を改善するために設計された2段階のフレームワークを提案する。第1段階では,高度なLCMを用いてCoTを蒸留して高レベル誘導を行い,それを教師付き微調整に用いる。第2段階では、最終出力と高レベルガイダンスの品質を協調的に最適化するガイダンス対応RL手法を導入する。
参考スコア（独自算出の注目度）: 22.177866778776814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.
Abstract（参考訳）: 大規模言語モデル(LLM)は複雑なタスクにおいて顕著な推論能力を示し、しばしばChain-of-Thought(CoT)推論に依存している。しかし、自己回帰的なトークンレベルの生成のため、推論プロセスは局所的な意思決定に大きく制約され、グローバルな計画が欠如している。この制限はしばしば冗長、不整合、不正確な推論をもたらし、全体的なパフォーマンスを著しく低下させる。木に基づくアルゴリズムや強化学習(RL)のような既存のアプローチでは、この問題に対処しようとするが、高い計算コストに悩まされ、しばしば最適な推論軌道を生成することができない。この課題に対処するため、我々は、高レベル計画ときめ細かいCoT推論の両方を改善するために設計された2段階のフレームワークであるグループ相対政策最適化PTA-GRPOによるプラン-Then-Action Enhanced Reasoningを提案する。第一段階では、先進的なLCMを用いてCoTを高温誘導装置に蒸留し、監督微調整(SFT)に使用する。第2段階では、最終出力と高レベルガイダンスの品質を協調的に最適化し、推論の有効性を高めるガイダンス対応RL手法を導入する。我々は、Qwen2.5-7B-Instruct、Qwen3-8B、Qwen3-14B、LLaMA3.2-3Bといった様々なベースモデルに対して、MATH、AIME2024、AIME2025、AMCを含む複数の数学的推論ベンチマークに関する広範な実験を行った。実験の結果、PTA-GRPOは、様々なモデルやタスクに対して安定かつ重要な改善を一貫して達成し、その有効性と一般化を検証した。

論文の概要: Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

関連論文リスト