Fugu-MT 論文翻訳(概要): GraphPO: Graph-based Policy Optimization for Reasoning Models

論文の概要: GraphPO: Graph-based Policy Optimization for Reasoning Models

arxiv url: http://arxiv.org/abs/2606.18954v1
Date: Wed, 17 Jun 2026 11:37:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.135997
Title: GraphPO: Graph-based Policy Optimization for Reasoning Models
Title（参考訳）: GraphPO: 推論モデルのためのグラフベースのポリシー最適化
Authors: Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、大規模推論モデルの能力向上のための標準パラダイムとなっている。ツリーベースの手法は、プレフィックスを共有し、同じプレフィックスから分岐を比較して、きめ細かい信号を提供することによってこの問題に対処する。提案するグラフPOは,有向非巡回グラフとしてロールアウトを表現した新しいRLフレームワークであり,エッジとしての推論ステップとノードとしての推論パスから要約されたセマンティックステートを持つ。
参考スコア（独自算出の注目度）: 39.010538168884786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.
Abstract（参考訳）: RLVR(Reinforcement Learning with Verifiable Rewards)は、大規模推論モデルの能力向上のための標準パラダイムとなっている。 RLVRは通常、応答を独立してサンプリングし、最終回答からポリシーを最適化する。このパラダイムには2つの制限がある。第一に、独立応答は、しばしば同様の中間的推論ステップを含み、冗長な探索と無駄な計算を引き起こす。第二に、粗末な最終回答報酬は、有用なステップを特定するのを難しくする。ツリーベースの手法は、プレフィックスを共有し、同じプレフィックスから分岐を比較して、きめ細かい信号を提供することによってこの問題に対処する。しかし、木の枝はいまだに独立して拡張されている。異なる枝が同様の推論状態に達すると、情報を共有し、同様の探索を繰り返すことはできない。さらに、木に基づく手法はそのような分散を無視し、別々のブランチ内でのみ局所的な比較を行う。この課題に対処するため、我々は、エッジとしての推論ステップと、ノードとしての推論パスから要約されたセマンティックステートを備えた、ロールアウトを非巡回グラフとして表現する新しいRLフレームワークであるGraphPO(Graph-based Policy Optimization)を提案する。 GraphPOは意味論的に等価な推論パスを同値クラスにマージし、接尾辞を共有し、冗長な拡張から多様な探索へと予算を割り当てることを可能にする。さらに, プロセスの監督を成果から導出しつつ, 効率性の向上を図るとともに, 効率性の向上を図る。理論によると、GraphPOは利点推定のばらつきを低減し、推論効率を高める。推論とエージェント検索ベンチマークによる3つのLSMの実験では、GraphPOは、同じトークン予算やレスポンス予算でチェーンベースとツリーベースベースラインを一貫して上回っている。

論文の概要: GraphPO: Graph-based Policy Optimization for Reasoning Models

関連論文リスト