Fugu-MT 論文翻訳(概要): PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models

論文の概要: PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models

arxiv url: http://arxiv.org/abs/2401.05252v1
Date: Wed, 10 Jan 2024 16:27:38 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-11 14:06:33.239448
Title: PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models
Title（参考訳）: PIXART-{\delta}:潜時一貫性モデルによる高速かつ制御可能な画像生成
Authors: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li
Abstract要約: PIXART-deltaはテキストと画像の合成フレームワークである。 LCM(Latent Consistency Model)とControlNetをPIXART-alphaモデルに統合する。 PIXART-deltaは1024x1024ピクセル画像を生成するのに0.5秒のブレークスルーを達成している。
参考スコア（独自算出の注目度）: 93.29160233752413
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This technical report introduces PIXART-{\delta}, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-{\delta} significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta} achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images, marking a 7x improvement over the PIXART-{\alpha}. Additionally, PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability (von Platen et al., 2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.
Abstract（参考訳）: この技術報告では、LCM(Latent Consistency Model)とControlNetを高度なPIXART-{\alphaモデルに統合するテキスト・ツー・イメージ合成フレームワークであるPIXART-{\deltaを紹介した。 PIXART-{\alpha} は、1024pxの解像度の高品質な画像を、極めて効率的なトレーニングプロセスで生成できることで認識されている。 PIXART-{\delta} への LCM の統合は推論速度を大幅に加速し、わずか2-4ステップで高品質な画像を生成することができる。特に、PIXART-{\delta}は1024x1024ピクセル画像を生成するのに0.5秒のブレークスルーを達成し、PIXART-{\alphaよりも7倍改善された。さらに、PIXART-{\delta}は、1日で32GBのV100 GPUで効率的にトレーニングできるように設計されている。 8ビット推論機能(von platen et al., 2023)により、pixart-{\delta}は8gbのgpuメモリ制約で1024px画像を合成でき、ユーザビリティとアクセシビリティが大幅に向上する。さらに、コントロールネットのようなモジュールを組み込むことで、テキスト間拡散モデルのきめ細かい制御が可能になる。本稿では,トランスフォーマーに適した新しい制御Net-Transformerアーキテクチャを導入し,高品質な画像生成とともに明示的な制御性を実現する。最新のオープンソースの画像生成モデルであるpixart-{\delta}は、安定した拡散系列の代替となり、テキストから画像への合成に大きく寄与する。

関連論文リスト

PixNerd: Pixel Neural Field Diffusion [30.872185815524286]
本稿では、ニューラルネットワークを用いてパッチワイズデコーディングをモデル化し、単一スケール、単一ステージ、効率的、エンドツーエンドのソリューションを提案する。 PixNerdの効率的なニューラルネットワーク表現のおかげで、ImageNetで2.15 FID、ImageNetで2.56times256$、2.84 FIDを、複雑なカスケードパイプラインやVAEなしで直接達成しました。
論文参考訳（メタデータ） (2025-07-31T06:07:20Z)
Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation [27.795313102716726]
画像のコンパクトな離散表現のための1次元バイナリ画像ラテントを提案する。提案手法は, 1次元ラテントのコンパクト性を維持しながら, 高分解能の細部を保存できる。我々のテキスト・ツー・イメージモデルは、拡散と自己回帰の両方で競合する性能を達成した最初のモデルです。
論文参考訳（メタデータ） (2025-06-26T05:48:36Z)
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction [91.09318592542509]
この研究は、視覚自己回帰モデリングにおける残差予測パラダイムに挑戦する。新しいフレキシブルなVisual AutoRegressiveイメージ生成パラダイムを提供する。このシンプルで直感的なアプローチは、視覚分布を素早く学習し、生成プロセスをより柔軟で適応可能にします。
論文参考訳（メタデータ） (2025-02-27T17:39:17Z)
FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution [33.07779971446476]
任意の解像度で高画質画像を効率よく生成できる、純粋に畳み込みに基づく生成モデルであるFlowDCNを提案する。 FlowDCNは256Times256$ ImageNet Benchmarkと同等の解像度外挿結果で最先端の4.30 sFIDを実現している。 FlowDCNはスケーラブルで柔軟な画像合成のための有望なソリューションであると考えています。
論文参考訳（メタデータ） (2024-10-30T02:48:50Z)
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation [95.29102596532854]
トケナイザーは複雑な視覚データをコンパクトな潜在空間にマッピングする翻訳機として機能する。本稿では,共同画像とビデオトークン化のためのトランスフォーマーベースのトークンライザであるOmniTokenizerについて述べる。
論文参考訳（メタデータ） (2024-06-13T17:59:26Z)
An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-dimensional Tokenizer (TiTok) は、画像を1D潜在シーケンスにトークン化する革新的なアプローチである。 TiTokは最先端のアプローチと競合するパフォーマンスを実現している。我々の最高性能の変種は、DiT-XL/2 (gFID 2.13 vs. 3.04) をはるかに上回りながら、高品質なサンプルを74倍高速に生成できる。
論文参考訳（メタデータ） (2024-06-11T17:59:56Z)
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation [110.10627872744254]
PixArt-Sigmaは4K解像度で画像を直接生成できる拡散変換器モデルである。 PixArt-Sigmaは、非常に高い忠実度とテキストプロンプトとのアライメントを改善した画像を提供する。
論文参考訳（メタデータ） (2024-03-07T17:41:37Z)
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers [2.078423403798577]
画像生成モデルであるHourglass Diffusion Transformer (HDiT)を提案する。数十億のパラメータにスケールすることが知られているTransformerアーキテクチャに基づいて構築され、畳み込みU-Netの効率とTransformerのスケーラビリティのギャップを埋める。
論文参考訳（メタデータ） (2024-01-21T21:49:49Z)
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [108.83343447275206]
本稿では,トランスフォーマーを用いたT2I拡散モデルであるPIXART-$alpha$について述べる。最大1024pxまでの高解像度画像合成をサポートし、訓練コストが低い。 PIXART-$alpha$は画質、芸術性、セマンティックコントロールに優れていた。
論文参考訳（メタデータ） (2023-09-30T16:18:00Z)
CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying [52.91778151771145]
本稿では,近年の連続的暗黙表現の発達により,その限界を初めて破ろうとする。実験の結果,提案手法はGTX 2080 Ti GPUを用いて2048$times$2048の画像をリアルタイムに処理できることがわかった。
論文参考訳（メタデータ） (2023-03-15T11:13:51Z)
ImageSig: A signature transform for ultra-lightweight image recognition [0.0]
ImageSigは計算シグネチャに基づいており、畳み込み構造やアテンションベースのエンコーダを必要としない。 ImageSigはRaspberry PiやJetson-nanoのようなハードウェアで前例のないパフォーマンスを示している。
論文参考訳（メタデータ） (2022-05-13T23:48:32Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。