Fugu-MT 論文翻訳(概要): CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

論文の概要: CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

arxiv url: http://arxiv.org/abs/2603.20741v1
Date: Sat, 21 Mar 2026 10:00:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.065238
Title: CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration
Title（参考訳）: CTCal: クロスステップ自己校正によるテキスト・画像拡散モデルの再考
Authors: Xiefan Guo, Xinzhu Ma, Haiyu Zhang, Di Huang,
Abstract要約: 我々は、テキストプロンプトと生成された画像の正確なアライメントを実現するために、CTCal(Cross-Timestep Self-Calibration)を導入する。 CTCalはモデルに依存しないため、既存のテキスト・画像拡散モデルにシームレスに統合することができる。
参考スコア（独自算出の注目度）: 39.59945414053394
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in text-to-image synthesis have been largely propelled by diffusion-based models, yet achieving precise alignment between text prompts and generated images remains a persistent challenge. We find that this difficulty arises primarily from the limitations of conventional diffusion loss, which provides only implicit supervision for modeling fine-grained text-image correspondence. In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. CTCal leverages the reliable text-image alignment (i.e., cross-attention maps) formed at smaller timesteps with less noise to calibrate the representation learning at larger timesteps with more noise, thereby providing explicit supervision during training. We further propose a timestep-aware adaptive weighting to achieve a harmonious integration of CTCal and diffusion loss. CTCal is model-agnostic and can be seamlessly integrated into existing text-to-image diffusion models, encompassing both diffusion-based (e.g., SD 2.1) and flow-based approaches (e.g., SD 3). Extensive experiments on T2I-Compbench++ and GenEval benchmarks demonstrate the effectiveness and generalizability of the proposed CTCal. Our code is available at https://github.com/xiefan-guo/ctcal.
Abstract（参考訳）: 近年のテキスト・画像合成の進歩は拡散モデルによって大きく促進されているが、テキスト・プロンプトと生成された画像の正確なアライメントを実現することは、まだ持続的な課題である。この困難は, 従来の拡散損失の限界に起因し, 微粒なテキスト画像対応をモデル化するための暗黙の監督のみを提供する。本稿では,拡散モデル内での正確なテキスト画像アライメントの確立は,時間経過が増加するにつれて徐々に困難になる,という観測結果に基づいて,CTCal(Cross-Timestep Self-Calibration)を導入する。 CTCalは、より小さな時間ステップで形成された信頼性の高いテキストイメージアライメント(すなわち、クロスアテンションマップ)を活用して、よりノイズの多い大きな時間ステップでの表現学習を校正し、トレーニング中に明確な監督を提供する。さらに,CTCalと拡散損失の調和的な統合を実現するための時間段階適応重み付けを提案する。 CTCalはモデルに依存しず、既存のテキストと画像の拡散モデルにシームレスに統合することができ、拡散ベース(例:SD 2.1)とフローベースアプローチ(例:SD 3)の両方を含む。 T2I-Compbench++とGenEvalベンチマークの大規模な実験は、提案したCTCalの有効性と一般化性を実証している。私たちのコードはhttps://github.com/xiefan-guo/ctcal.comから入手可能です。

論文の概要: CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

関連論文リスト