Fugu-MT 論文翻訳(概要): Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

論文の概要: Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

arxiv url: http://arxiv.org/abs/2605.14885v1
Date: Thu, 14 May 2026 14:28:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.873698
Title: Masked Next-Scale Prediction for Self-supervised Scene Text Recognition
Title（参考訳）: 自己教師型シーン音声認識のためのマスク次世代予測
Authors: Zhuohao Chen, Zeng Li, Yifei Zhang, Chang Liu, Yu Zhou,
Abstract要約: シーンテキスト認識は、粗いレイアウトからきめ細かい文字ストロークへと進化する視覚構造をモデル化する必要がある。 Masked Image Modeling (MIM)のような近年の自己教師型アプローチは、大規模な未ラベルデータを活用することで、この依存関係を緩和している。我々は,MNSP(Masked Next-Scale Prediction)を紹介した。
参考スコア（独自算出の注目度）: 11.805372246648957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP
Abstract（参考訳）: シーンテキスト認識は、粗いレイアウトからきめ細かい文字ストロークへと進化する視覚構造をモデル化する必要がある。このようなモデルのトレーニングは、大量の注釈付きデータに依存する。 Masked Image Modeling (MIM)のような近年の自己教師型アプローチは、大規模な未ラベルデータを活用することで、この依存関係を緩和している。しかし、既存のMIM手法の多くは単一の空間スケールで動作しており、シーンテキストの階層的な性質を捉えていない。本研究では,MNSP(Masked Next-Scale Prediction)を提案する。このフレームワークにはNext-Scale Prediction (NSP)が組み込まれており、低解像度のコンテキストから高解像度の特徴を予測することによって階層的な表現を学ぶ。しかし,Naive Scaleの予測は空間的に拡散した注意を生じさせる傾向があり,テキスト構造よりも背景領域にモデルを向ける傾向にある。 MNSPは、クロススケールな予測とマスクされた画像再構成を共同で学習することで、この制限を解消する。 NSPは、解像度を越えてグローバルなレイアウトの先行をキャプチャし、マスクされた再構築は、情報的なテキスト領域に注意を向ける強い局所的な制約を課している。 Multi-scale Linguistic Alignmentモジュールは、さまざまな解像度のセマンティック一貫性をさらに維持する。大規模な実験により、MNSPは最先端のパフォーマンスを達成し、挑戦的なUnion14Mベンチマークでは86.2\%、標準データセットでは96.7\%に達した。余分な解析により,過度なスケールとレイアウトの変動によるロバスト性の向上が示された。コードはhttps://github.com/CzhczhcHczh/MNSPで入手できる。

論文の概要: Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

関連論文リスト