Fugu-MT 論文翻訳(概要): CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

論文の概要: CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

arxiv url: http://arxiv.org/abs/2305.14014v1
Date: Tue, 23 May 2023 12:51:20 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-24 16:31:13.751364
Title: CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Title（参考訳）: CLIP4STR: 事前学習型視覚言語モデルによるシーンテキスト認識のための簡易ベースライン
Authors: Shuai Zhao, Xiaohan Wang, Linchao Zhu, Yi Yang
Abstract要約: CLIP4STRは,CLIPの画像エンコーダとテキストエンコーダ上に構築された,シンプルで効果的なシーンテキスト認識手法である。 CLIP4STRは11のSTRベンチマークで新しい最先端のパフォーマンスを実現する。
参考スコア（独自算出の注目度）: 67.21528544724546
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pre-trained vision-language models are the de-facto foundation models for various downstream tasks. However, this trend has not extended to the field of scene text recognition (STR), despite the potential of CLIP to serve as a powerful scene text reader. CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. With such merits, we introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new state-of-the-art performance on 11 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. We believe our method establishes a simple but strong baseline for future STR research with VL models.
Abstract（参考訳）: 事前訓練された視覚言語モデルは、様々な下流タスクのデファクト基盤モデルである。しかし、この傾向はCLIPが強力なシーンテキストリーダーとして機能する可能性にもかかわらず、シーンテキスト認識(STR)の分野には及ばない。 CLIPは、自然画像中の正規(水平)および不規則(回転、湾曲、ぼやけた、あるいは隠された)テキストを堅牢に識別することができる。このようなメリットにより、CLIPのイメージエンコーダとテキストエンコーダ上に構築された、シンプルで効果的なSTRメソッドであるCLIP4STRを導入する。ビジュアルブランチとクロスモーダルブランチの2つのエンコーダ/デコーダブランチがある。視覚分岐は、視覚特徴に基づく初期予測を提供し、横断的分岐は、視覚特徴とテキスト意味論の相違に対処することによって、この予測を洗練させる。両分岐の機能を完全に活用するために、推論のための2つの予測と再定義の復号方式を設計する。 CLIP4STRは11のSTRベンチマークで新しい最先端のパフォーマンスを実現する。さらに、CLIPのSTRへの適応の理解を高めるための総合的な実証研究が提供される。 VLモデルを用いた将来のSTR研究において,本手法は単純だが強力なベースラインを確立する。

論文の概要: CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

関連論文リスト