Fugu-MT 論文翻訳(概要): Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

論文の概要: Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

arxiv url: http://arxiv.org/abs/2606.24484v1
Date: Tue, 23 Jun 2026 12:18:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:48.939475
Title: Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods
Title（参考訳）: WordArt指向のシーンテキスト認識の強化:データセットと方法
Authors: Xingsong Ye, Yongkun Du, Jiaxin Zhang, Haojie Zhang, Chong Sun, Chen Li, Jing Lyu, Zhineng Chen,
Abstract要約: WordArt(アートテキスト)は高度にカスタマイズされたフォント、テクスチャ、レイアウトを備えている。既存のSTRとメソッドは、通常、シーンテキストと固定テンプレート入力を中心に構築され、ウォーターにスケールするのに苦労する。既存の芸術的テキストデータと比較して、数百倍の精度で2Mの合成データセットであるWater-Sを構築した。
参考スコア（独自算出の注目度）: 24.552635491974417
License: http://creativecommons.org/licenses/by/4.0/
Abstract: WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.
Abstract（参考訳）: WordArt(アートテキスト)は高度にカスタマイズされたフォント、テクスチャ、レイアウトを備えており、WordArt指向のシーン TExt Recognition (WATER) は一般的なScene Text Recognition (STR) よりもかなり難しい。既存のSTRデータセットとメソッドは、通常、シーンテキストと固定テンプレート入力を中心に構築され、ウォーターにスケールするのに苦労する。そこで我々は,この課題をデータとモデルの両方の観点から進めることを目指している。データ側では、2Mの合成データセットであるWater-Sを構築し、既存の芸術的テキストデータと比較して数百倍のスケール改善を行った。 Water-Sは2つの相補的な部分集合からなる。 1つはアップグレードされたレンダリングパイプライン(SynthWordArt)によってレンダリングされ、高度に正確で制御可能な合成WordArtデータを提供する。もうひとつは、即時マイニングのためのQwen3-VLと画像合成のためのZ-Imageを組み合わせることで、現実的で多様なデータのカバレッジを向上させる。モデルとして,Waterecを提案する。任意の形の入力をサポートするビジュアルエンコーダと、複雑なレイアウトをモデル化するための自動回帰デコーダを採用し、WordArt上の固定テンプレートSTRのボトルネックを構造的に破壊する。実験により、このアーキテクチャは以前のSTRメソッドよりも優れており、WordArtのような不規則なテキスト上で最先端のパフォーマンスを実現している。既存のSTRデータから慎重に再構成されたWater-Rとともに、新しい合成データとモデル設計による強力なベースラインは、WordArt-Bench上で90.40%の精度に達し、汎用およびOCR特化視覚言語モデルの両方を大きなマージンで上回っている。コードとデータはhttps://github.com/YesianRohn/WATER.comで公開されている。

論文の概要: Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

関連論文リスト