Fugu-MT 論文翻訳(概要): High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

論文の概要: High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

arxiv url: http://arxiv.org/abs/2603.13389v1
Date: Wed, 11 Mar 2026 07:02:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.154069
Title: High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding
Title（参考訳）: 分布制約拡散復号による事前学習型視覚言語モデルからの高忠実テキスト・画像生成
Authors: Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom, Qi Dai, Chong Luo, Chang D. Yoo,
Abstract要約: 本稿では,出力画像に対する拡散復号器のみを訓練することにより,画像の忠実度を向上させる拡散復号化フレームワークを提案する。軽量なLogitは、VQ-VAEエンコーダからのトレーニング時間プロキシロジットとVLM生成ロジットをアライメントすることで、トレイン推論ギャップを緩和する。提案手法は,VLM予測トークンからVQ-VAE再構成とテキスト・ツー・画像生成の両方の視覚的忠実度を継続的に向上する。
参考スコア（独自算出の注目度）: 64.13126192228604
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.
Abstract（参考訳）: 近年の大規模視覚言語モデル (VLM) はテキスト・画像生成能力に優れるが、その視覚的忠実度は離散的な画像トークン化によって制約され、大きな課題となっている。いくつかの研究は、視覚的品質を向上させるために連続表現モデリングを研究してきたが、そのような表現に事前学習されたVLMモデルを適用するには、大規模なデータとトレーニングコストが元の事前学習に匹敵する。この制限を回避するために,プリトレーニング済みVLMの出力画像トケロジットに拡散デコーダのみをトレーニングすることにより,画像の忠実度を高める拡散型デコーダを提案する。中心となるLogit-to-Code Distributional Mappingは、VLMのイメージトーケンロジットを不確実性のある連続分布重み付きコードベクトルに変換し、拡散復号のための効果的な条件信号を提供する。軽量なLogit Calibrationは、VQ-VAEエンコーダからのトレーニング時間プロキシロジットとVLM生成ロジットを調整し、トレイン-推論ギャップを緩和する。これらの表現を条件に、分散制御拡散デコーダは高忠実度画像を生成する。画像Net-1Kの短時間のトレーニングによってのみ達成され、VQ-VAE再構成とVLM予測トークンからのテキスト・ツー・画像生成の両面での視覚的忠実度を一貫して向上させる。

論文の概要: High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

関連論文リスト