Fugu-MT 論文翻訳(概要): FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

論文の概要: FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

arxiv url: http://arxiv.org/abs/2603.06038v1
Date: Fri, 06 Mar 2026 08:47:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.392695
Title: FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography
Title（参考訳）: FontUse: スタイルとユースケースを規定したインイメージタイポグラフィーへのデータ中心的アプローチ
Authors: Xia Xin, Yuki Endo, Yoshihiro Kanamori,
Abstract要約: 我々は、タイポグラフィーに特化した構造化アノテーションパイプラインから誘導されるターゲットインスペクションを用いて、画像生成モデルを訓練する。当社のパイプラインでは,ユーザフレンドリーなプロンプトを付加した約70Kイメージからなる,大規模なタイポグラフィー中心のデータセットであるFontUseを構築している。評価のために、生成したタイポグラフィーと要求属性のアライメントを測定するLong-CLIPベースのメトリクスを導入する。
参考スコア（独自算出の注目度）: 5.862480696321742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.
Abstract（参考訳）: 最近のテキスト・ツー・イメージモデルでは、自然言語のプロンプトから高品質な画像を生成することができるが、タイポグラフィーの制御は依然として難しい。この制限を,タイポグラフィに特化した構造的アノテーションパイプラインをベースとした,画像生成モデルを訓練するデータ中心型アプローチで解決する。当社のパイプラインは,ユーザフレンドリなプロンプト,テキストリージョン位置,OCR認識文字列を付加した約70Kイメージからなる,大規模なタイポグラフィー中心のデータセットであるFontUseを構築している。アノテーションはセグメンテーションモデルとマルチモーダル大言語モデル(MLLM)を使用して自動生成される。このプロンプトはフォントスタイル(例、セリフ、スクリプト、エレガント)とユースケース(例、結婚式の招待状、コーヒーショップメニュー)を明示的に組み合わせ、初心者でも直感的な仕様を実現できる。これらのアノテーションで既存のジェネレータを微調整することで、アーキテクチャの変更なしに、スタイルやユースケースの条件をテキストプロンプトとして一貫して解釈することができる。評価のために、生成したタイポグラフィーと要求属性のアライメントを測定するLong-CLIPベースのメトリクスを導入する。さまざまなプロンプトとレイアウトの実験では、パイプラインでトレーニングされたモデルが、競合するベースラインよりもプロンプトとより一貫性のあるテキストレンダリングを生成することが示されています。アノテーションパイプラインのソースコードはhttps://github.com/xiaxinz/FontUSE.orgで公開されている。

論文の概要: FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

関連論文リスト