Fugu-MT 論文翻訳(概要): FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

論文の概要: FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

arxiv url: http://arxiv.org/abs/2402.12376v3
Date: Thu, 10 Oct 2024 13:43:20 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:30.891813
Title: FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model
Title（参考訳）: FiTv2: 拡散モデルのためのスケーラブルでフレキシブルな視覚変換器
Authors: Zidong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, Lei Bai,
Abstract要約: 本稿では,非制限解像度とアスペクト比で画像を生成するためのトランスフォーマーアーキテクチャを提案する。総合的な実験は、FiTv2の幅広い解像度での異常な性能を実証している。
参考スコア（独自算出の注目度）: 80.69865295743149
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2x convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at https://github.com/whlzy/FiT to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.
Abstract（参考訳）: 自然は無限に分解できない。この現実の文脈では、Diffusion Transformersのような既存の拡散モデルは、訓練されたドメインの外で画像解像度を処理する場合、しばしば課題に直面します。この制限に対処するために、画像は固定解像度グリッドとして知覚される従来の方法ではなく、動的サイズのトークンのシーケンスとして概念化する。この視点は、トレーニングと推論の間、様々なアスペクト比をシームレスに適合させるフレキシブルなトレーニング戦略を可能にする。本研究では,非制限解像度とアスペクト比で画像を生成するためのトランスアーキテクチャであるフレキシブル・ビジョン・トランス (FiT) を提案する。さらに我々は、Query-Keyベクトル正規化、AdaLN-LoRAモジュール、修正フロースケジューラ、Logit-Normalサンプルラなど、いくつかの革新的な設計でFiTをFiTv2にアップグレードする。微調整されたネットワーク構造によって強化されたFiTv2は、FiTの2倍の収束速度を示す。高度なトレーニングフリーな外挿技術を導入すると、FiTv2は分解能外挿と多彩な分解能生成の両方において顕著な適応性を示す。さらに、FiTv2モデルのスケーラビリティを探索した結果、より大きなモデルの方が計算効率が良いことが判明した。さらに,高分解能生成のための事前学習モデルを適用するための効率的なポストトレーニング戦略を導入する。総合的な実験は、FiTv2の幅広い解像度での異常な性能を実証している。我々は、任意の解像度の画像生成のための拡散トランスフォーマーモデルの探索を促進するために、https://github.com/whlzy/FiTで全てのコードとモデルをリリースした。

論文の概要: FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

関連論文リスト