Fugu-MT 論文翻訳(概要): Freestyle Layout-to-Image Synthesis

論文の概要: Freestyle Layout-to-Image Synthesis

arxiv url: http://arxiv.org/abs/2303.14412v1
Date: Sat, 25 Mar 2023 09:37:41 GMT
ステータス: 翻訳完了
システム内更新日: 2023-03-28 19:52:51.924227
Title: Freestyle Layout-to-Image Synthesis
Title（参考訳）: フリースタイルレイアウト画像合成
Authors: Han Xue, Zhiwu Huang, Qianru Sun, Li Song, Wenjun Zhang
Abstract要約: この研究では、モデルの自由なスタイルの能力、すなわち、与えられたレイアウト上に見知らぬセマンティクスをどの程度生成できるかを探索する。これに触発されて、我々は、大規模事前訓練されたテキスト-画像拡散モデルを利用して、目に見えないセマンティクスを生成することを選んだ。提案した拡散ネットワークは,テキスト入力を多用したリアルかつ自由なレイアウト・ツー・イメージ生成結果を生成する。
参考スコア（独自算出の注目度）: 42.64485133926378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet.
Abstract（参考訳）: 典型的なレイアウト・ツー・イメージ合成(LIS)モデルは、COCO-Stuffの182の共通オブジェクトのような、閉じたセマンティッククラスのイメージを生成する。本研究では,与えられたレイアウトに対して,無意味なセマンティクス(クラス,属性,スタイルなど)をどこまで生成できるかという,モデルのフリースタイル機能を調べ,タスクフリースタイルlis (flis) と呼ぶ。大規模な事前訓練型言語画像モデルの開発により、限られたベースクラスで訓練された多くの識別モデル(画像分類や物体検出など)が、見当たらないクラス予測能力によって強化される。これに触発された我々は、大規模事前訓練されたテキスト-画像拡散モデルを利用して、目に見えないセマンティクスを生成する。 FLISの鍵となる課題は、拡散モデルを特定のレイアウトから画像の合成を可能にすることである。この目的のために,拡散モデルに簡単に接続可能なRectified Cross-Attention (RCA) と呼ばれるモジュールを導入し,セマンティックマスクを統合する。この「プラグイン」は、画像とテキストトークンの間のアテンションマップを正すために、モデルの各クロスアテンション層に適用される。 RCAの鍵となる考え方は、各テキストトークンに特定の領域のピクセルに作用するように強制することであり、事前訓練された知識(一般的には)から様々な意味論を所定のレイアウト(具体的)に自由に配置できるようにすることである。広汎な実験により,提案した拡散ネットワークは,多種多様なテキスト入力によるリアルかつ自由なレイアウト・ツー・イメージ生成結果を生成することがわかった。コードはhttps://github.com/essunny310/FreestyleNetで入手できる。

論文の概要: Freestyle Layout-to-Image Synthesis

関連論文リスト