Fugu-MT 論文翻訳(概要): Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

論文の概要: Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

arxiv url: http://arxiv.org/abs/2407.15811v1
Date: Mon, 22 Jul 2024 17:23:28 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-23 13:51:10.713263
Title: Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
Title（参考訳）: 各ドラーのストレッチング:マイクロ予算によるスクラッチからの拡散訓練
Authors: Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, Lingjuan Lyu,
Abstract要約: 大規模T2I拡散変圧器モデルの低コスト化を実証する。我々は16億のパラメータスパーストランスをわずか1890ドルの経済的コストで訓練し、ゼロショット世代で12.7 FIDを達成する。我々は、マイクロ予算での大規模拡散モデルのトレーニングをさらに民主化するために、エンドツーエンドのトレーニングパイプラインをリリースすることを目指している。
参考スコア（独自算出の注目度）: 53.311109531586844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As scaling laws in generative AI push performance, they also simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to address this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. As the computational cost of transformers increases with the number of patches in each image, we propose to randomly mask up to 75% of the image patches during training. We propose a deferred masking strategy that preprocesses all patches using a patch-mixer before masking, thus significantly reducing the performance degradation with masking, making it superior to model downscaling in reducing computational cost. We also incorporate the latest improvements in transformer architecture, such as the use of mixture-of-experts layers, to improve performance and further identify the critical benefit of using synthetic images in micro-budget training. Finally, using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only \$1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset. Notably, our model achieves competitive FID and high-quality generations while incurring 118$\times$ lower cost than stable diffusion models and 14$\times$ lower cost than the current state-of-the-art approach that costs \$28,400. We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
Abstract（参考訳）: 生成的AIにおける法則のスケーリングによってパフォーマンスが向上すると同時に、大きな計算資源を持つアクター間でこれらのモデルの開発を同時に集中させる。テキスト・トゥ・イメージ(T2I)生成モデルに着目し,大規模T2I拡散変圧器モデルの低コストな訓練を実証することにより,このボトルネックに対処することを目指している。変換器の計算コストが各画像のパッチ数とともに増加するにつれて、トレーニング中の画像パッチの75%をランダムにマスクすることを提案する。マスク前にパッチミキサーを用いて全パッチを前処理する遅延マスキング手法を提案する。また, マイクロ予算トレーニングにおいて, 合成画像を使用することによる重要な利点を明らかにするため, 実験層を混在させることなど, トランスフォーマーアーキテクチャの最新の改良も取り入れた。最後に、利用可能な実画像と合成画像の3700万枚しか使用せず、1,890ドルの経済的コストで16億個のパラメータスパーストランスフォーマーをトレーニングし、COCOデータセット上でゼロショット生成で12.7 FIDを達成する。特に、我々のモデルは、安定拡散モデルよりも118$\times$安いコストと、28,400ドルという現在の最先端アプローチよりも14$\times$低いコストを発生させながら、競争力のあるFIDと高品質な世代を達成する。我々は、マイクロ予算での大規模拡散モデルのトレーニングをさらに民主化するために、エンドツーエンドのトレーニングパイプラインをリリースすることを目指している。

関連論文リスト

Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation [65.46359545280546]
本稿では,効率的なトレーニングパイプラインを形成するための拡散変圧器-タンバ蒸留(T2MD)について紹介する。我々は,効率とグローバルな依存関係を同時に達成する拡散自己注意とマンバハイブリッドモデルを確立する。実験により、トレーニングパスはオーバーヘッドが低く、高品質のテキスト・ツー・イメージ生成につながることが示された。
論文参考訳（メタデータ） (2025-06-23T18:01:19Z)
HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
本稿では,HarmoniCaと呼ばれる新しい学習ベースのキャッシュフレームワークを提案する。 SDT(Step-Wise Denoising Training)を取り入れて、Denoisingプロセスの継続性を保証する。また、画像品質とキャッシュ利用のバランスをとるために、イメージエラープロキシガイドオブジェクト(IEPO)も組み込まれている。
論文参考訳（メタデータ） (2024-10-02T16:34:29Z)
TerDiT: Ternary Diffusion Models with Transformers [88.03738506648291]
TerDiTは、低ビット拡散変圧器モデルのための最初の量子化対応トレーニングスキームである。モデルサイズは600Mから4.2B、画像解像度は256$times$256から512$times$512である。
論文参考訳（メタデータ） (2024-05-23T17:57:24Z)
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [56.849285913695184]
Diffusion Mamba (DiM) は高分解能画像合成のためのシーケンスモデルである。 DiMアーキテクチャは高解像度画像の推論時間効率を実現する。実験は、我々のDiMの有効性と効率を実証する。
論文参考訳（メタデータ） (2024-05-23T06:53:18Z)
A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE [0.8403582577557918]
Transformerは画像認識タスクに採用され、トレーニングコストと計算複雑性に悩まされているが、CNNやRNNよりも優れていた。本稿では,ResNetの代わりにNeural ODEをバックボーンとして使用する軽量ハイブリッドモデルを提案する。提案モデルは,エッジコンピューティングのための最小サイズのFPGAデバイス上に展開される。
論文参考訳（メタデータ） (2024-01-05T09:32:39Z)
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
整合性トレーニング損失の最適化は,目標分布と生成分布とのワッサーシュタイン距離を最小化することを示す。 CIFAR10 と ImageNet 64$times$64 と LSUN Cat 256$times$256 データセットの FID スコアを改善する。
論文参考訳（メタデータ） (2023-11-23T16:49:06Z)
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [108.83343447275206]
本稿では,トランスフォーマーを用いたT2I拡散モデルであるPIXART-$alpha$について述べる。最大1024pxまでの高解像度画像合成をサポートし、訓練コストが低い。 PIXART-$alpha$は画質、芸術性、セマンティックコントロールに優れていた。
論文参考訳（メタデータ） (2023-09-30T16:18:00Z)
Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
マスク付き変圧器を用いた大規模拡散モデルの学習に有効な手法を提案する。具体的には、トレーニング中に拡散された入力画像のパッチの割合をランダムにマスキングする。 ImageNet-256x256 と ImageNet-512x512 の実験により,我々の手法は最先端の拡散変換器 (DiT) モデルよりも競争力があり,より優れた生成性能が得られることが示された。
論文参考訳（メタデータ） (2023-06-15T17:38:48Z)
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models [166.64847903649598]
汎用的なパッチワイドトレーニングフレームワークであるPatch Diffusionを提案する。 Patch Diffusionは、データ効率を改善しながら、トレーニング時間を大幅に削減する。我々は最先端のベンチマークと一致して優れたFIDスコアを得る。
論文参考訳（メタデータ） (2023-04-25T02:35:54Z)
Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration [6.611849560359801]
本稿では,自己アテンションモジュールの強力なグローバルモデリング能力と,全体のアーキテクチャにおける畳み込みの局所モデリング能力を組み合わせたDual-formerを提案する。実験により、Dual-formerはIndoorデータセットの最先端MAXIM法よりも1.91dBのゲインを達成していることが示された。単一画像のデライニングでは、わずか21.5%のGFLOPを持つ5つのデータセットの平均結果に対して、SOTA法を0.1dB PSNRで上回っている。
論文参考訳（メタデータ） (2022-10-03T16:39:21Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。