Fugu-MT 論文翻訳(概要): MARBLE: Multi-Aspect Reward Balance for Diffusion RL

論文の概要: MARBLE: Multi-Aspect Reward Balance for Diffusion RL

arxiv url: http://arxiv.org/abs/2605.06507v1
Date: Thu, 07 May 2026 16:20:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.988734
Title: MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Title（参考訳）: MARBLE:拡散RLのためのマルチアスペクトリワードバランス
Authors: Canyu Zhao, Hao Chen, Yunze Tong, Yu Qiao, Jiacheng Li, Chunhua Shen,
Abstract要約: 強化学習は、拡散モデルと人間の嗜好を整合させる主要なアプローチとなっている。既存のプラクティスは、報酬ごとに1つのスペシャリストモデルをトレーニングすることで、複数の報酬を処理します。我々は,各報酬に対する独立な優位推定器を維持する勾配空間最適化フレームワークMARBLEを提案する。
参考スコア（独自算出の注目度）: 71.6241143519038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
Abstract（参考訳）: 強化学習の微調整は、拡散モデルと人間の嗜好を整合させる主要なアプローチとなっている。しかし,画像評価は本質的に多次元課題であり,複数の評価基準を同時に最適化する必要がある。既存のプラクティスは、報酬ごとに1つのスペシャリストモデルをトレーニングし、重み付きサム報酬を$R(x)=\sum_k w_k R_k(x)$を最適化するか、手作りのステージスケジュールで順次微調整することで、複数の報酬を扱う。これらのアプローチは、すべての報酬に対して共同でトレーニングできる統一モデルの作成に失敗するか、あるいは手動で調整されたシーケンシャルトレーニングを必要とする。失敗の原因は、単純重み付け報酬アグリゲーションを使うことにある。このアプローチはサンプルレベルのミスマッチに悩まされるが、ほとんどのロールアウトは専門的なサンプルであり、特定の報酬の次元に対して非常に有益であるが、他の人には無関係である。この問題に対処するために、我々は、各報酬に対する独立な優位推定器を維持し、逆ポリシー勾配を計算し、擬似プログラミング問題を解くことで、手動で調整された報酬重み付けをせずに単一の更新方向に調和する勾配空間最適化フレームワーク MARBLE (Multi-Aspect Reward BaLancE) を提案する。さらに,DiffusionNFTにおける損失のアフィン構造を利用して,K+1の後方通過からほぼ一逆ベースラインコストへのステップあたりのコストを削減し,バランス係数を滑らかにすることで,過渡的な単一バッチ変動に対する更新を安定化するアフィン構造を提案する。 5つの報酬を持つSD3.5ミディアムでは、MARBLEは5つの報酬の次元を同時に改善し、最低整列の報酬の勾配コサインを80%のミニバッチの重み付け和で負の値から常に正にし、ベースライントレーニングのトレーニング速度の0.97倍の速度で走らせる。

論文の概要: MARBLE: Multi-Aspect Reward Balance for Diffusion RL

関連論文リスト