Fugu-MT 論文翻訳(概要): TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

論文の概要: TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

arxiv url: http://arxiv.org/abs/2605.19320v1
Date: Tue, 19 May 2026 03:55:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.108431
Title: TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards
Title（参考訳）: TextAlign:階層的リワードによるテキストレンダリングのための優先アライメント
Authors: Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi, Jiaming Wang, Zirui Song, Fajri Koto, Xiuying Chen,
Abstract要約: テキストレンダリングを学習後の嗜好調整問題として研究する。キーコンポーネントは階層型視覚言語モデル(VLM)ベースの報酬で、レンダリングエラーをグローバル、ワード、グリフのレベルに分解する。 FLUX.1-devとZ-Image-Turboの実験では、一般的な生成品質を劣化させることなく、OCRベースのテキスト精度が一貫した向上を示した。
参考スコア（独自算出の注目度）: 25.768329293709176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.
Abstract（参考訳）: 忠実なテキストレンダリングは、意味的な指示とグリフレベルの微細な構造の両方を必要とするため、大きなテキストからイメージへの生成モデルの永続的な弱点である。従来の手法では、アーキテクチャ固有のモジュールやエンコーダの変更によって、ファンデーションモデル間のデプロイメントが複雑になるため、この機能を改善することが多かった。本研究では,テキストレンダリングを学習後の嗜好調整問題として研究し,ジェネレータアーキテクチャを変更せずに維持する非侵襲的フレームワークであるTextAlignを提案する。鍵となるコンポーネントは階層的視覚言語モデル(VLM)ベースの報酬で、レンダリングエラーをグローバル、ワード、グリフレベルに分解し、二項欠陥判定をスカラー優先信号に変換する。結果として得られる信号は、グループ相対政策最適化(GRPO)と直接選好最適化(DPO)の両方をサポートする。 FLUX.1-devとZ-Image-Turboの実験では、一般的な生成品質を劣化させることなく、OCRベースのテキスト精度が一貫した向上を示した。 SD3.5、Qwen-Image、AnyText、TextDiffuserなどの強力な基盤とテキストレンダリングベースラインと比較して、これらの結果は、報酬設計がテキストレンダリングを改善するためのモデル再設計に代わるスケーラブルな代替手段を提供することを示している。

論文の概要: TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

関連論文リスト