Fugu-MT 論文翻訳(概要): Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

論文の概要: Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

arxiv url: http://arxiv.org/abs/2311.13231v3
Date: Sat, 23 Mar 2024 05:23:00 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-27 02:25:46.264280
Title: Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
Title（参考訳）: 逆流モデルのない微動拡散モデルへの人間のフィードバックの利用
Authors: Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li,
Abstract要約: 細管拡散モデルに対するD3PO(Denoising Diffusion Policy Optimization)法について述べる。 D3POは報酬モデルのトレーニングを省略するが、人間のフィードバックデータを用いてトレーニングされた最適報酬モデルとして効果的に機能する。実験では,目的の相対尺度を人間の嗜好のプロキシとして使用し,地道報酬を用いた手法に匹敵する結果を与える。
参考スコア（独自算出の注目度）: 38.25406127216304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences, then leverage RL techniques to fine-tune the underlying models. However, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. The direct preference optimization (DPO) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. However, the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue, we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. In experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. Moreover, D3PO demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO.
Abstract（参考訳）: 人間のフィードバックを用いた強化学習(RLHF)は、微調整拡散モデルにおいて大きな可能性を示している。これまでの方法は、人間の好みに合わせて報酬モデルをトレーニングし、RL技術を利用して基礎となるモデルを微調整することから始まる。しかし、効率的な報酬モデルを構築するには、広範なデータセット、最適なアーキテクチャ、手動のハイパーパラメータチューニングが必要であり、プロセスは時間とコストの両方に集約される。大規模言語モデルの微調整に有効な直接選好最適化(DPO)法は,報奨モデルの必要性を排除している。しかし,拡散モデルのデノナイジングプロセスにおけるGPUメモリの広範な要求は,DPO法の直接適用を妨げる。この問題に対処するため、直列拡散モデルにD3PO(Denoising Diffusion Policy Optimization)法を導入する。理論的解析により,D3POは報酬モデルのトレーニングを省略するが,人間のフィードバックデータを用いて学習過程をガイドする最適な報酬モデルとして効果的に機能することが示された。このアプローチでは、報酬モデルのトレーニングを必要とせず、より直接的でコスト効率が良く、計算オーバーヘッドを最小限に抑えることが証明される。実験では,目的の相対尺度を人間の嗜好の代名詞として使用し,地道報酬を用いた手法に匹敵する結果を与える。さらに、D3POは画像歪み率を低減し、より安全な画像を生成する能力を示し、ロバストな報酬モデルに欠ける課題を克服する。私たちのコードはhttps://github.com/yk7333/D3POで公開されています。

論文の概要: Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

関連論文リスト