Fugu-MT 論文翻訳(概要): FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

論文の概要: FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

arxiv url: http://arxiv.org/abs/2605.20316v1
Date: Tue, 19 May 2026 17:59:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.296216
Title: FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
Title（参考訳）: FullFlow:双方向ビジョン・ランゲージ生成のためのテキストと画像のフローマッチングモデルの改良
Authors: Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann,
Abstract要約: EmphFullFlowは、事前訓練された修正フローのテキスト・トゥ・イメージ・モデルを双方向の視覚言語生成装置にアップグレードする、パラメータ効率のよいレシピである。 FullFlowは、イメージを彼らのネイティブな継続的フローに保持し、テキストに個別の挿入プロセスを追加する。異なる画像とテキストのタイムステップは、推論を2次元生成空間における軌跡選択に変換する。
参考スコア（独自算出の注目度）: 39.06289388005218
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.
Abstract（参考訳）: 現代のテキストから画像への拡散モデルは、リッチな視覚的事前情報を符号化するが、一方的なテキスト条件付き生成によってのみ公開する。既存の統合視覚言語モデルは、テキストパスの大規模な共同事前訓練または実質的な再訓練を通じて双方向の能力を回復し、既にエンコードされているテキスト・ツー・イメージのバックボーンよりも前の強い画像を捨てる。本稿では,LoRAアダプタと軽量テキストヘッドのみをトレーニングすることで,事前トレーニング済みの修正フローテキスト-画像モデルから双方向の視覚言語ジェネレータにアップグレードするパラメータ効率の高いレシピである \emph{FullFlow} を紹介する。 FullFlowは、イメージを彼らのネイティブな継続的フローに保持し、テキストに個別の挿入プロセスを追加する。分離された画像とテキストのタイムステップは、推論を2次元生成空間における軌跡選択に変換し、text$\rightarrow$image、 image$\rightarrow$text、ジョイントサンプリング、および1つのバックボーンによる部分テキスト予測を可能にする。安定拡散3(SD3)では、同じトレーニング可能なパラメータカウントと一致したLoRAランクの下で、FullFlowはテキスト$\rightarrow$image FIDを62.7$から31.6$に改善し、画像$\rightarrow$text CIDErを$2.0$から$99.4$に改善し、マッチしたウォールクロックのトレーニング時間における以前のSOTAの定式化(Dual Diffusion)に続き、ピーク時のVRAMを${\sim}84$から${\sim}38$に削減し、スループットを${\sim}8\times$を2つのRTX A5000 GPU上で24時間以下のトレーニングで引き上げる。同じレシピはFLUX.1-devに転送され、部分テキスト生成を通じて下流のVQAをサポートする。これらの結果から, マルチモーダル事前学習を必要とせず, 事前学習したテキスト・イメージ・フローモデルから, 強力な双方向視覚言語機能を解き放つことが可能であることが示唆された。

論文の概要: FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

関連論文リスト