Fugu-MT 論文翻訳(概要): Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

論文の概要: Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

arxiv url: http://arxiv.org/abs/2606.06113v2
Date: Thu, 11 Jun 2026 12:02:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 13:39:59.477241
Title: Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback
Title（参考訳）: テキストから画像へのフィードバックのための構造的欠陥グラウンドの場所、理由、重要性
Authors: Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen, Haoxiang Cao, Feng Lu, Wendong Zhang, Changqian Yu, Chun Yuan,
Abstract要約: テキスト・トゥ・イメージ(T2I)モデルは依然として局所的で微妙で構造的に複雑な失敗を示す。構造的欠陥接地は, 各欠陥を位置, タイプ, 理由, 重要性としてモデル化し, 構造的集合予測としてT2Iの診断を行う。我々の検出器は、構造欠陥接地におけるプロプライエタリなVLMよりも優れています。
参考スコア（独自算出の注目度）: 51.08692072066352
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.
Abstract（参考訳）: ますますフォトリアリスティックな画像を生成するが、テキスト・トゥ・イメージ(T2I)モデルは依然として局所的で微妙で構造的に複雑な失敗を示す。これらの障害を診断するには、欠陥の発生場所、その型、なぜ欠陥があるのか、そして画像の全体的な品質に対する重要性に答える、インスタンスレベルのフィードバックが必要である。最近の高密度フィードバック法はスカラー・インスペクションを超えているが、熱マップ中心の表現は依然としてピクセルフィールドの回帰として診断を定式化しており、可変心電図の欠陥を局所化し、意味的理由を個々の障害に結びつけることは困難である。この表現ボトルネックに対処するために,T2I診断を構造的集合予測として,各欠陥を(位置,型,理由,重要性)タプルとしてモデル化する構造的欠陥接地法(SDG)を提案する。 SDG-30Kは4つの現代的なT2Iジェネレータにまたがるボックスグラウンドアノテーションを備えた30Kイメージデータセットであり、専用の評価プロトコルであるSDG-Evalも導入する。この構造的表現に基づいて、視覚言語モデル(VLM)がSDG検出器として機能し、BoxFlow-GRPOは予測された欠陥セットを、拡散モデルアライメントのためのボックス由来の重要重み付き空間報酬に変換する。我々のSDG検出器は、構造欠陥接地においてプロプライエタリなVLMよりも優れており、SDG誘導報酬は一貫してT2Iアライメントを改善し、ローカライズドイメージリファインメントをサポートする。これらの結果は、SDGを近代的な生成モデルの診断、評価、拡張のための統合されたインスタンスレベルのインターフェースとして確立する。

論文の概要: Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

関連論文リスト