Fugu-MT 論文翻訳(概要): UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

論文の概要: UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

arxiv url: http://arxiv.org/abs/2210.16031v1
Date: Fri, 28 Oct 2022 10:07:25 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-31 15:13:09.584776
Title: UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance
Title（参考訳）: UPainting: クロスモーダル誘導による統一テキスト・画像拡散生成
Authors: Wei Li, Xue Xu, Xinyan Xiao, Jiachen Liu, Hu Yang, Guohao Li, Zhanpeng Wang, Zhifan Feng, Qiaoqiao She, Yajuan Lyu, Hua Wu
Abstract要約: 我々は,シンプルかつ複雑なシーン画像生成を統一する,シンプルで効果的なアプローチ,すなわちUPaintingを提案する。アーキテクチャの改善と多様なガイダンススケジュールに基づいて、UPaintingは事前訓練された画像テキストマッチングモデルからのクロスモーダルガイダンスをテキスト条件拡散モデルに統合する。 UPaintingは、単純なシーンと複雑なシーンの両方において、キャプションの類似性と画像の忠実さという点で、他のモデルよりも大幅に優れています。
参考スコア（独自算出の注目度）: 40.488455270651684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing image generation models mainly include text conditional diffusion model and cross-modal guided diffusion model, which are good at small scene image generation and complex scene image generation respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify simple and complex scene image generation, as shown in Figure~\ref{fig:leading_samples}. Based on architecture improvements and diverse guidance schedules, UPainting effectively integrates cross-modal guidance from a pretrained image-text matching model into a text conditional diffusion model that utilizes a pretrained Transformer language model as the text encoder. Our key findings is that combining the power of large-scale Transformer language model in understanding language and image-text matching model in capturing cross-modal semantics and style, is effective to improve sample fidelity and image-text alignment of image generation. In this way, UPainting has a more general image generation capability, which can generate images of both simple and complex scenes more effectively. %On the COCO dataset, UPainting achieves much better performance than Stable Diffusion, one of the state-of-the-art text-to-image diffusion models. To comprehensively compare text-to-image models, we further create a more general benchmark, UniBench, with well-written Chinese and English prompts in both simple and complex scenes. We compare UPainting with recent models and find that UPainting greatly outperforms other models in terms of caption similarity and image fidelity in both simple and complex scenes.
Abstract（参考訳）: 拡散生成モデルは最近、テキスト条件付き画像生成のパワーを大幅に改善している。既存の画像生成モデルは主にテキスト条件付き拡散モデルとクロスモーダル誘導拡散モデルを含み、それぞれ小さなシーン画像生成と複雑なシーン画像生成に適している。本研究では,図~\ref{fig:leading_samples}に示すように,単純かつ複雑なシーン画像生成を統一する,単純かつ効果的な手法を提案する。 UPaintingは、アーキテクチャの改善と多様なガイダンススケジュールに基づいて、事前訓練された画像テキストマッチングモデルからのクロスモーダルガイダンスを、事前訓練されたトランスフォーマー言語モデルをテキストエンコーダとして利用するテキスト条件拡散モデルに統合する。我々の重要な発見は、言語理解における大規模トランスフォーマー言語モデルと、クロスモーダルなセマンティクスとスタイルをキャプチャする画像テキストマッチングモデルを組み合わせることで、画像生成のサンプル忠実度と画像テキストアライメントを改善することができることである。このように、upaintingはより一般的な画像生成機能を持ち、シンプルで複雑なシーンのイメージをより効果的に生成できる。 % COCOデータセット上で、UPaintingは、最先端のテキスト画像拡散モデルの1つであるStable Diffusionよりもはるかに優れたパフォーマンスを実現している。テキストと画像のモデルを包括的に比較するため、より一般的なベンチマークであるUniBenchを、簡素かつ複雑な場面で中国語と英語のプロンプトで作成する。 UPaintingを最近のモデルと比較すると、UPaintingは他のモデルよりもキャプションの類似性や画像の忠実度において、シンプルかつ複雑なシーンで大幅に優れています。

論文の概要: UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

関連論文リスト