Fugu-MT 論文翻訳(概要): DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting

論文の概要: DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting

arxiv url: http://arxiv.org/abs/2305.19957v1
Date: Wed, 31 May 2023 15:44:00 GMT
ステータス: 翻訳完了
システム内更新日: 2023-06-01 15:31:03.012025
Title: DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting
Title（参考訳）: DeepSolo++: テキストスポッティングのための明示的なポイントを持つトランスフォーマーデコーダ
Authors: Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, Dacheng Tao
Abstract要約: DeepSoloは単純なDETRライクなベースラインで、テキストの検出と認識を同時に効率的にするための明示的なポイントを持つ1つのデコーダを提供する。 DeepSoloは英語のシーンだけでなく、複雑なフォント構造と1000レベルの文字クラスで中国語の書き起こしを習得している。私たちは、多言語テキストスポッティングのためのDeepSolo++をローンチし、多言語テキスト検出、認識、スクリプト識別を同時に行うために、明示的なポイントを持つTransformerデコーダを単独で使用できるようにしました。
参考スコア（独自算出の注目度）: 129.73247700864385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection and recognition simultaneously and efficiently. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, DeepSolo not only performs well in English scenes but also masters the Chinese transcription with complex font structure and a thousand-level character classes. On the other hand, based on the extensibility of DeepSolo, we launch DeepSolo++ for multilingual text spotting, making a further step to let Transformer decoder with explicit points solo for multilingual text detection, recognition, and script identification all at once. Extensive experiments on public benchmarks demonstrate that our simple approach achieves better training efficiency compared with Transformer-based models and outperforms the previous state-of-the-art. In addition, DeepSolo and DeepSolo++ are also compatible with line annotations, which require much less annotation cost than polygons. The code is available at \url{https://github.com/ViTAE-Transformer/DeepSolo}.
Abstract（参考訳）: エンドツーエンドテキストスポッティングは、シーンテキストの検出と認識を統一されたフレームワークに統合することを目的としている。 2つのサブタスク間の関係を扱うことは、効果的なスポッターを設計する上で重要な役割を果たす。トランスフォーマーベースの手法ではヒューリスティックなポストプロセッシングは排除されるが、サブタスクと低トレーニング効率の相乗効果の問題に苦しむ。本稿では,DeepSoloについて述べる。DeepSoloは単純なDETRライクなベースラインで,テキストの検出と認識を同時に,かつ効率的に行うことができる。技術的には、各テキストインスタンスでは、文字列を順序付けポイントとして表現し、学習可能な明示的なポイントクエリでモデル化します。 1つのデコーダを渡すと、ポイントクエリは必要なテキストセマンティクスと場所をエンコードする。さらに, 文字クラス, 言語タイプ, タスクの観点から, 驚くほど優れた拡張性を示す。一方、deepsoloは英語のシーンでうまく機能するだけでなく、複雑なフォント構造と1000レベルの文字クラスで中国語の書き起こしを習得する。一方、DeepSoloの拡張性に基づいて、多言語テキストスポッティング用のDeepSolo++をローンチし、多言語テキスト検出、認識、スクリプト識別を同時に行うための明示的なポイントを持つTransformerデコーダをさらに一歩進める。公開ベンチマークによる広範囲な実験により,本手法はトランスフォーマーモデルと比較してトレーニング効率が向上し,先行手法よりも優れていた。さらに、DeepSoloとDeepSolo++は行アノテーションとも互換性があり、ポリゴンよりもアノテーションコストがはるかに低い。コードは \url{https://github.com/ViTAE-Transformer/DeepSolo} で公開されている。

論文の概要: DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting

関連論文リスト