Fugu-MT 論文翻訳(概要): TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

論文の概要: TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

arxiv url: http://arxiv.org/abs/2604.24459v1
Date: Mon, 27 Apr 2026 13:28:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.018778
Title: TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering
Title（参考訳）: TextGround4M:レイアウト対応テキストレンダリングのためのプロンプト対応データセット
Authors: Dongxing Mao, Yilin Wang, Linjie Li, Zhengyuan Yang, Alex Jinpeng Wang,
Abstract要約: テキスト・ツー・イメージ生成の進歩にもかかわらず、モデルはまだ、プロンプト仕様のテキストを正しい空間レイアウトでレンダリングするのに苦労している。 TextGround4Mは400万以上のプロンプトイメージ対のデータセットで、プロンプトと対応するバウンディングボックスにグラウンドされたスパンレベルテキストをアノテートする。本稿では、モデルアーキテクチャや推論動作を変更することなく、トレーニング中にレイアウトを意識したスパントークンを付加する自動回帰型T2Iモデルの軽量なトレーニング戦略を提案する。
参考スコア（独自算出の注目度）: 64.22226877213521
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout -- especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.
Abstract（参考訳）: 近年のテキスト・ツー・イメージ生成の進歩にもかかわらず、モデルでは、特にマルチスパンで構造化された設定において、プロンプト指定されたテキストを正確な空間レイアウトで正確にレンダリングすることに苦慮している。この課題は、画像に期待される正確なテキストとレイアウトにプロンプトを合わせるデータセットの欠如だけでなく、レイアウト品質を評価するための効果的な指標の欠如によっても引き起こされる。これらの問題に対処するために,我々は,400万以上のプロンプトイメージペアからなる大規模データセットであるTextGround4Mを紹介した。これにより、レイアウト対応、即席のテキストレンダリングのきめ細かい監督が可能になる。これに基づいて、モデルアーキテクチャや推論動作を変更することなく、トレーニング中にレイアウト対応スパントークンを付加する自動回帰型T2Iモデルの軽量なトレーニング戦略を提案する。さらに、ゼロショット設定でオープンソースモデルとプロプライエタリモデルの両方を評価するために、階層化レイアウトの複雑さを伴うベンチマークを構築した。さらに,テキストレンダリングにおける空間的評価の長年の欠如に対処するために,2つのレイアウト対応メトリクスを導入する。その結果,TextGround4Mでトレーニングしたモデルは,テキストの忠実度,空間精度,即時一貫性において高いベースラインを達成し,T2I生成におけるきめ細かいレイアウト管理の重要性を浮き彫りにした。

論文の概要: TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

関連論文リスト