Fugu-MT 論文翻訳(概要): Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

論文の概要: Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

arxiv url: http://arxiv.org/abs/2311.15657v2
Date: Wed, 17 Jul 2024 05:52:37 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-18 23:08:38.959494
Title: Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Title（参考訳）: テキストエンコーダ強化学習による拡散モデルの強化
Authors: Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin,
Abstract要約: テキストから画像への拡散モデルは通常、ログのような目的を最適化するために訓練される。近年の研究では、強化学習や直接バックプロパゲーションを通じて人間の報酬を用いて拡散U-Netを精製することでこの問題に対処している。我々は、強化学習によってテキストエンコーダを微調整することにより、結果のテキストイメージアライメントを強化することができることを示した。
参考スコア（独自算出の注目度）: 63.41513909279474
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.
Abstract（参考訳）: テキストから画像への拡散モデルは、通常、画像美学や画像テキストアライメントといった下流タスクの特定の要求を満たす際の課題を示すログのような目的を最適化するために訓練される。近年の研究では、強化学習や直接バックプロパゲーションを通じて人間の報酬を用いて拡散U-Netを精製することでこの問題に対処している。しかし、その多くがテキストエンコーダの重要性を軽視している。本稿では、強化学習によってテキストエンコーダを微調整することにより、結果のテキストイメージアライメントを強化し、視覚的品質を向上させることを実証する。私たちの主な動機は、現在のテキストエンコーダが最適以下であり、しばしば注意深い迅速な調整が必要であるという観察から来ています。 U-Netの微調整はパフォーマンスを部分的に改善するが、最適でないテキストエンコーダに悩まされている。そこで本研究では,タスク固有の報酬に基づいてテキストエンコーダを微調整するために,低ランク適応による強化学習を提案する。まず,テキストエンコーダの微調整により拡散モデルの性能が向上することを示す。次に、TexForceが既存のU-Netの微調整モデルと簡単に組み合わせて、追加のトレーニングなしでより優れた結果を得ることができることを示す。最後に,高品質な顔画像や手画像の生成など,多種多様な応用において,本手法の適応性を示す。

論文の概要: Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

関連論文リスト