Fugu-MT 論文翻訳(概要): The Character Error Vector: Decomposable errors for page-level OCR evaluation

論文の概要: The Character Error Vector: Decomposable errors for page-level OCR evaluation

arxiv url: http://arxiv.org/abs/2604.06160v1
Date: Tue, 07 Apr 2026 17:56:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.985781
Title: The Character Error Vector: Decomposable errors for page-level OCR evaluation
Title（参考訳）: 文字誤りベクトル:ページレベルのOCR評価における分解可能な誤り
Authors: Jonathan Bourne, Mwiza Simbeye, Joseph Nockels,
Abstract要約: 本稿では,OCRのキャラクタ評価器であるキャラクタエラーベクトル(CEV)を紹介する。 CEVはパースとOCRとインタラクションエラーコンポーネントに分解できる。我々は、他のメトリクスに対してCEVのパフォーマンスを検証する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV's performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.
Abstract（参考訳）: 文字誤り率(CER)は、光学文字認識(OCR)の品質を評価するための重要な指標である。しかし、この計量はテキストが完全に解析されたと仮定しており、しばしばそうではない。ページパーシングエラーの下では、CERは未定義となり、メトリクスとしての使用を制限し、特にラベルスキーマを共有しないデータを使用する場合、ページレベルのOCRを評価するのが困難になる。本稿では,OCRのキャラクタ評価器であるキャラクタエラーベクトル(CEV)を紹介する。 CEVはパースとOCRとインタラクションエラーコンポーネントに分解できる。この分解性により、実践者はドキュメント理解パイプラインの一部に集中することができる。 CEVは,SpACER(Spatially Aware Character Error Rate)とJensen-Shannon Distanceを用いた文字分布法を実証する様々な手法を用いて実装することができる。まず、CERとの関係、次に品質を解析し、最後に、ページレベルのOCR品質の直接的な測定としてCEVのパフォーマンスを検証する。検証プロセスは、CEVがメトリクスのパースとCERのようなローカルメトリクスの間の貴重なブリッジであることを示している。我々は、複雑なレイアウトを持つ劣化したイメージで構成されたアーカイブ新聞のデータセットを分析し、最先端のエンドツーエンドモデルが従来のパイプラインアプローチよりも優れていることを発見した。 CEVは最適なトリアージのために文字レベルの位置決めを必要とするが、容易に利用可能な値の閾値付けはF1の0.91で主エラーソースを予測することができる。ドキュメント理解研究を支援するために,Pythonライブラリの一部としてCEVを提供しています。

論文の概要: The Character Error Vector: Decomposable errors for page-level OCR evaluation

関連論文リスト