Fugu-MT 論文翻訳(概要): SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

論文の概要: SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

arxiv url: http://arxiv.org/abs/2604.12617v2
Date: Fri, 17 Apr 2026 10:49:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 13:38:49.295728
Title: SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
Title（参考訳）: SOAR: 拡散モデルにおける最適配向と縮小のための自己補正
Authors: You Qin, Linqing Wang, Hao Fei, Roger Zimmermann, Liefeng Bo, Qinglin Lu, Chunyu Wang,
Abstract要約: 拡散モデルのための後トレーニングパイプラインには、キュレートされたデータに対する教師付き微調整(SFT)と報酬モデルによる強化学習(RL)の2段階がある。本稿では,このギャップを埋めるバイアス補正ポストトレーニング法であるSOAR(Self-Correction for Optimal Alignment and Refinement)を提案する。オンライン政治であり、報酬なしであり、クレジット割り当ての問題なく、時間ごとの密集した監督を提供する。
参考スコア（独自算出の注目度）: 48.335262141752715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.
Abstract（参考訳）: 拡散モデルのための後トレーニングパイプラインは、現在、キュレートされたデータに対する教師付き微調整(SFT)と、報酬モデルによる強化学習(RL)の2段階がある。根本的なギャップはそれらを分離する。 SFTは、フォワードノイズ発生過程からサンプリングされた基底構造状態のみをデノワザを最適化し、推論がこれらの理想状態から逸脱すると、その後のデノワザは学習された補正ではなく、アウト・オブ・ディストリビューションの一般化に依存し、自己回帰モデルに干渉するのと同じ露出バイアスを示すが、トークンシーケンスではなくデノワザの軌道に沿って蓄積される。 RLは原則としてこのミスマッチに対処できるが、端末の報酬信号は希少であり、クレジット割り当ての困難に悩まされ、報酬のハッキングのリスクがある。本稿では,このギャップを埋めるバイアス補正ポストトレーニング法であるSOAR(Self-Correction for Optimal Alignment and Refinement)を提案する。実際のサンプルから始めて、SOARは、現在のモデルで1つの停止段階的なロールアウトを実行し、結果の軌道外状態を再ノイズ化し、モデルを元のクリーンターゲットに戻すように監督します。この方法は、オンラインで、報酬なしであり、クレジット割り当ての問題なく、時間ごとの密集した監督を提供する。 SD3.5-Mediumでは、SOARはGenEvalを0.70から0.78に改善し、OCRは0.64から0.67に改善した。コントロールされた報酬特化実験では、報酬モデルにアクセスできないにもかかわらず、SOARは美的およびテキストイメージのアライメントタスクの最終的な測定値でFlow-GRPOを上回ります。 SOARのベース損失は標準のSFT目標を仮定するので、事前訓練後、SFTをより強力な訓練後の第1段階として置き換えることができるが、その後のRLアライメントと完全に互換性が保たれる。

論文の概要: SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

関連論文リスト