Fugu-MT 論文翻訳(概要): Context Perception Parallel Decoder for Scene Text Recognition

論文の概要: Context Perception Parallel Decoder for Scene Text Recognition

arxiv url: http://arxiv.org/abs/2307.12270v1
Date: Sun, 23 Jul 2023 09:04:13 GMT
ステータス: 翻訳完了
システム内更新日: 2023-07-25 17:00:42.470330
Title: Context Perception Parallel Decoder for Scene Text Recognition
Title（参考訳）: シーンテキスト認識のためのコンテキスト知覚並列デコーダ
Authors: Yongkun Du and Zhineng Chen and Caiyan Jia and Xiaoting Yin and Chenxia Li and Yuning Du and Yu-Gang Jiang
Abstract要約: シーンテキスト認識手法は高い精度と高速な推論速度を達成するのに苦労している。本研究では,1つのPDパスで文字列を復号化するためのコンテキスト知覚並列デコーダ(CPPD)を提案する。 CPPDモデルは、高い競争精度を達成する。さらに、ARモデルよりも約7倍速く動作し、また、最も高速な認識器の1つである。
参考スコア（独自算出の注目度）: 46.48087342632804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based STR model uses the previously recognized characters to decode the next character iteratively. It shows superiority in terms of accuracy. However, the inference speed is slow also due to this iteration. Alternatively, parallel decoding (PD)-based STR model infers all the characters in a single decoding pass. It has advantages in terms of inference speed but worse accuracy, as it is difficult to build a robust recognition context in such a pass. In this paper, we first present an empirical study of AR decoding in STR. In addition to constructing a new AR model with the top accuracy, we find out that the success of AR decoder lies also in providing guidance on visual context perception rather than language modeling as claimed in existing studies. As a consequence, we propose Context Perception Parallel Decoder (CPPD) to decode the character sequence in a single PD pass. CPPD devises a character counting module and a character ordering module. Given a text instance, the former infers the occurrence count of each character, while the latter deduces the character reading order and placeholders. Together with the character prediction task, they construct a context that robustly tells what the character sequence is and where the characters appear, well mimicking the context conveyed by AR decoding. Experiments on both English and Chinese benchmarks demonstrate that CPPD models achieve highly competitive accuracy. Moreover, they run approximately 7x faster than their AR counterparts, and are also among the fastest recognizers. The code will be released soon.
Abstract（参考訳）: Scene Text Recognition (STR) 法は高い精度と高速な推論速度を達成するのに苦労している。自己回帰(AR)ベースのSTRモデルは、事前に認識された文字を使って次の文字を反復的に復号する。精度の点で優位性を示す。しかし、この反復により推論速度も遅くなる。あるいは、並列デコード(PD)ベースのSTRモデルは、すべての文字を1つのデコードパスで推測する。推論速度の面では利点があるが、そのようなパスで堅牢な認識コンテキストを構築するのは難しいため、精度が悪くなる。本稿では,STRにおけるARデコーディングの実証的研究について述べる。また,ARデコーダの精度向上に加えて,ARデコーダの成功は,既存の研究で主張されている言語モデリングよりも,視覚的文脈認識のガイダンスを提供することにも寄与していることがわかった。その結果,1つのPDパスで文字列を復号化するためのコンテキスト知覚並列デコーダ (CPPD) を提案する。 CPPDは文字カウントモジュールと文字順序モジュールを考案する。テキストインスタンスが与えられた場合、前者は各文字の発生回数を推定し、後者は文字読み順序とプレースホルダーを推定する。キャラクタ予測タスクと合わせて、キャラクタシーケンスとキャラクタの出現場所をロバストに指示するコンテキストを構築し、arデコードによって伝達されるコンテキストをよく模倣する。英語と中国語のベンチマークの実験は、CPPDモデルが高い競争精度を達成することを示した。さらに、ARよりも約7倍高速で動作し、最も高速な認識器の1つである。コードはまもなくリリースされる。

関連論文リスト

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison [27.44915531637358]
本研究では,高密度機能プリペンディング(DFP)とクロスアテンションアーキテクチャの性能を比較した。 DFPは広く採用されているが,本研究の結果はDFPのクロスアテンションに対する優位性を示すものではない。
論文参考訳（メタデータ） (2025-01-04T20:14:16Z)
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
CTCモデルであるSVTRv2を提案する。 SVTRv2は、テキストの不規則性に対処し、言語コンテキストを利用するための新しいアップグレードを導入した。我々は,SVTRv2を標準ベンチマークと最近のベンチマークの両方で評価した。
論文参考訳（メタデータ） (2024-11-24T14:21:35Z)
General Detection-based Text Line Recognition [15.761142324480165]
我々は、テキスト行認識に対する一般的な検出に基づくアプローチを、印刷(OCR)や手書き(HTR)として導入する。我々の手法は、自己回帰復号に依存する最先端のHTR手法とは全く異なるパラダイムに基づいている。我々は、CASIA v2データセット上での中国語スクリプト認識と、BorgおよびCopialeデータセット上での暗号認識の最先端性能を改善した。
論文参考訳（メタデータ） (2024-09-25T17:05:55Z)
When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition [57.51793420986745]
我々は、手書き数式認識(HMER)のための非従来型ネットワークであるCounting-Aware Network(CAN)を提案する。シンボルレベルの位置アノテーションを使わずに各シンボルクラスの数を予測できる弱教師付きカウントモジュールを設計する。 HMERのベンチマークデータセットの実験により、エンコーダ・デコーダモデルの予測誤差を修正するために、共同最適化とカウント結果の両方が有用であることが検証された。
論文参考訳（メタデータ） (2022-07-23T08:39:32Z)
Scene Text Recognition with Permuted Autoregressive Sequence Models [15.118059441365343]
コンテキスト対応STRメソッドは通常、内部自己回帰(AR)言語モデル(LM)を使用する。提案手法であるPARSeqは、置換言語モデリングを用いて、共有重み付き内部AR LMのアンサンブルを学習する。コンテキストフリーな非ARおよびコンテキスト対応AR推論を実現し、双方向コンテキストを用いた反復的洗練を実現する。
論文参考訳（メタデータ） (2022-07-14T14:51:50Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) は、歩行者を分離したカメラで識別する。実値特徴記述子を用いた既存のReID法は精度が高いが、ユークリッド距離計算が遅いため効率が低い。本稿では,ReID 処理を 0.25 倍高速化するサブスペース一貫性規則化 (SCR) アルゴリズムを提案する。
論文参考訳（メタデータ） (2022-07-13T02:44:05Z)
SVTR: Scene Text Recognition with a Single Visual Model [44.26135584093631]
パッチワイド画像トークン化フレームワークにおいて,シーンテキスト認識のための単一ビジュアルモデルを提案する。 SVTRと呼ばれるこの方法は、まずイメージテキストを小さなパッチに分解する。英語と中国語の両方のシーンテキスト認識タスクの実験結果から,SVTRの有効性が示された。
論文参考訳（メタデータ） (2022-04-30T04:37:01Z)
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates [59.678108707409606]
我々は、接続性時間分類(CTC)出力に基づいて非自己回帰デコードによりHIを生成する高速MDモデルであるFast-MDを提案し、続いてASRデコーダを提案する。高速MDは、GPUとCPUの「単純なMDモデル」よりも2倍、4倍高速なデコード速度を実現した。
論文参考訳（メタデータ） (2021-09-27T05:21:30Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
LASO (Listen Attentively, and Spell Once) と呼ばれる非自動回帰音声認識モデルを提案する。モデルは、エンコーダ、デコーダ、および位置依存集合体(PDS)からなる。
論文参考訳（メタデータ） (2021-02-15T15:18:59Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。