Fugu-MT 論文翻訳(概要): Rich Human Feedback for Text-to-Image Generation

論文の概要: Rich Human Feedback for Text-to-Image Generation

arxiv url: http://arxiv.org/abs/2312.10240v1
Date: Fri, 15 Dec 2023 22:18:38 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-19 17:51:04.910574
Title: Rich Human Feedback for Text-to-Image Generation
Title（参考訳）: テキスト対画像生成のためのリッチヒューマンフィードバック
Authors: Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katie Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam
Abstract要約: 我々は18K生成画像のリッチなフィードバックを収集し、マルチモーダルトランスフォーマーを訓練して、リッチなフィードバックを自動的に予測する。例えば、高品質なトレーニングデータを選択して微調整し、生成モデルを改善することで、予測されたリッチな人間のフィードバックを利用して画像生成を改善することができることを示す。
参考スコア（独自算出の注目度）: 27.030777546301376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models, prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper, we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text, and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants).
Abstract（参考訳）: 近年のテキスト・トゥ・イメージ(T2I)生成モデルでは,テキスト記述に基づく高解像度画像の生成が著しく進歩している。しかし、多くの生成画像は、アーティファクト/実装性、テキスト記述との誤認、低い美的品質といった問題に苦しんでいる。大規模言語モデルにおける強化学習(Reinforcement Learning with Human Feedback, RLHF)の成功に触発された先行研究は、生成された画像に対するフィードバックとして人為的なスコアを収集し、T2I生成を改善するための報酬モデルを訓練した。本稿ではフィードバック信号の強化について述べる。 (i)テキストと区別がつかない、又は不一致な画像領域をマークすること。 (ii)テキストプロンプトのどの単語が画像に誤表示されているか、あるいは欠落しているかを注釈する。このようなリッチなフィードバックを18K生成画像から収集し、マルチモーダルトランスフォーマーをトレーニングして、リッチなフィードバックを自動的に予測する。例えば、高品質なトレーニングデータを選択して生成モデルを微調整し改善したり、予測されたヒートマップでマスクを作成して問題領域に適応させることで、画像生成を改善することができることを示す。特に、この改良は、人間のフィードバックデータが収集された画像を生成するために使用されるモデル(ミューズ)に一般化される(安定拡散変種)。

論文の概要: Rich Human Feedback for Text-to-Image Generation

関連論文リスト