Fugu-MT 論文翻訳(概要): TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

論文の概要: TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

arxiv url: http://arxiv.org/abs/2605.10983v2
Date: Wed, 13 May 2026 08:00:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.86994
Title: TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
Title（参考訳）: TMPO:多変量および効率的な拡散アライメントのための軌道マッチングポリシー最適化
Authors: Jiaming Li, Chenyu Zhu, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou, Zhiyuan Ma,
Abstract要約: 本稿では,報酬を人間レベルの報酬分布マッチングに置き換えるトラジェクティブマッチングポリシバランス最適化(TMPO)を提案する。 TMPOは最先端の手法に対する生成的多様性を9.1%向上させ、下流および効率の指標で競合性能を達成する。大規模フロープレフィックスのマルチトラックトレーニング時間を短縮するため、TMPOはDynamic Tree Smplingモデルを導入し、動的にスケジュールされたステップでトラジェクトリがdenoisingとブランチを共有する。
参考スコア（独自算出の注目度）: 52.570581883709345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、拡散モデルと下流タスクの整合性を示すが、そのほとんどは、生成的多様性と品質を低下させ、視覚モードの崩壊を誘発し、信頼できない報酬を増幅する重大な報酬ハッキングに悩まされている。これらの手法のモード探索特性として根本原因を同定し、許容軌道上の確率分布を効果的に制限することなく期待される報酬を最大化し、いくつかの高逆経路に集中させる。対照的に,スカラー報酬最大化をトラジェクトリレベルの報酬分布マッチングに置き換えるトラジェクトリマッチングポリシー最適化(TMPO)を提案する。具体的には、TMPOは、K軌道の政策確率と報酬誘起ボルツマン分布とを一致させるために、Softmax Trajectory Balance (Softmax-TB) の目的を導入する。我々は,この目的が前方KL分岐のモード被覆特性を継承し,全ての許容軌道を網羅し,報酬を最適化することを示した。大規模フローマッチングモデルにおける多軌道トレーニング時間をさらに短縮するため、TMPOはDynamic Stochastic Tree Sampling(動的確率木サンプリング)を導入する。人選好, 構成生成, テキストレンダリングなどの多彩なアライメントタスクに対する広範な結果から, TMPOは最先端の手法よりも生成的多様性を9.1%向上させ, 下流と効率の指標の競争性能を向上し, 報酬と多様性の最適なトレードオフを実現する。

論文の概要: TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

関連論文リスト