Fugu-MT 論文翻訳(概要): SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

論文の概要: SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

arxiv url: http://arxiv.org/abs/2603.10250v1
Date: Tue, 10 Mar 2026 22:01:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.705851
Title: SiMPO: Measure Matching for Online Diffusion Reinforcement Learning
Title（参考訳）: SiMPO:オンライン拡散強化学習のためのマッチング測定
Authors: Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai,
Abstract要約: 一般単調関数を持つ拡散RLにおける再重み付けスキームを一般化する,シンプルで統一的なフレームワークであるSiMPOを紹介する。 SiMPOは2段階の測度マッチングレンズを介して拡散RLを再考する。我々は、負の再重み付けが準最適行動から政策を積極的に反映していることを示す幾何学的解釈を提供する。
参考スコア（独自算出の注目度）: 52.46919717963149
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by $f$-divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.
Abstract（参考訳）: 拡散政策によく使われるRLアルゴリズムのファミリーは、行動ポリシーに対してソフトマックスの重み付けを行い、通常は過度なポリシーを誘導し、負のサンプルからのフィードバックを活用できない。本稿では,一般単調関数を持つ拡散RLにおける再重み付けスキームを一般化する,シンプルで統一されたフレームワークであるシMPO(Signed Measure Policy Optimization)を紹介する。 SiMPOは2段階の測度マッチングレンズを介して拡散RLを再考する。まず、$f$-divergence正規化ポリシー最適化により仮想目標ポリシーを構築し、非負性制約を緩和し、署名された目標測度を許容する。第二に、この符号付き測度を用いて、再重み付けマッチングを通して拡散や流れのモデルを導出する。この定式化には2つの大きな利点がある。 a) 任意の単調に増大する重み付け関数に一般化し、そして b) 否定的再重み付けの原則的正当化及び実践的指針を提供する。さらに, 負の再重み付けが準最適行動から政策を積極的に反映することを示すために, 幾何学的解釈を提供する。広汎な経験的評価により,SiMPOはこれらのフレキシブルな重み付け方式を利用して優れた性能を実現し,報奨景観に適した重み付け手法を選択するための実践的ガイドラインを提供する。

論文の概要: SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

関連論文リスト