Fugu-MT 論文翻訳(概要): Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

論文の概要: Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

arxiv url: http://arxiv.org/abs/2605.30038v1
Date: Thu, 28 May 2026 14:57:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.407045
Title: Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models
Title（参考訳）: 拡散モデルにおけるテキストと画像のアライメントのためのアライメント誘導スコアマッチング
Authors: Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye,
Abstract要約: 拡散モデルは、非常にリアルなイメージを生成するが、しばしば正確なテキストイメージアライメントに苦労する。コントラッシブアライメントガイダンスを統合することで,ソフトトークンを改良する軽量で報酬のないポストトレーニング手法を提案する。提案手法はGenEvalベンチマークの精度を35%以上向上させる。
参考スコア（独自算出の注目度）: 46.98979357654374
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM
Abstract（参考訳）: 拡散モデルは、非常にリアルなイメージを生成するが、しばしば正確なテキストイメージアライメントに苦労する。最近のポストトレーニング手法では、外部報酬や人間の嗜好信号を用いてアライメントを改善するが、その性能は報酬の品質に大きく依存し、拡散プロセス自体のアライメントに直接対応しない。 SoftREPAのような最近の報酬のないアプローチは、コントラスト学習によるソフトテキストトークンの最適化が、テキスト画像の表現アライメントを効果的に改善し、標準パラメータ効率の微調整ベースラインを上回ることを実証している。しかし、対照的な定式化は負のペアを過度に罰し、過剰カウントや反復のような特徴的な障害ケースとして現れる。そこで本研究では,拡散モデルのスコアマッチング対象に直接コントラストアライメントガイダンスを組み込むことにより,ソフトトークンを改良する軽量で報酬のないポストトレーニング手法を提案する。スコアレベルでアライメントの方向を割り当てることで、これらの制限を緩和し、一貫性とセマンティックに忠実な世代を生み出す。実験の結果,本手法はSoftREPAと一致し,故障事例を大幅に改善し,GenEvalベンチマークの精度を35%以上向上した。本手法は既存の拡散バックボーン (SD1.5, SDXL, SD3) にシームレスに適用でき, 既存のRLに基づく拡散後訓練法と相補的である。プロジェクトページ:https://jaayeon.github.io/AGSM

論文の概要: Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

関連論文リスト