CLEval: Character-Level Evaluation for Text Detection and Recognition
Tasks
- URL: http://arxiv.org/abs/2006.06244v1
- Date: Thu, 11 Jun 2020 08:12:39 GMT
- Title: CLEval: Character-Level Evaluation for Text Detection and Recognition
Tasks
- Authors: Youngmin Baek, Daehyun Nam, Sungrae Park, Junyeop Lee, Seung Shin,
Jeonghun Baek, Chae Young Lee, Hwalsuk Lee
- Abstract summary: Existing evaluation metrics fail to provide a fair and reliable comparison among text detection and recognition methods.
Based on the fact that character is a key element of text, we propose a Character-Level Evaluation metric (CLEval)
CLEval provides a fine-grained evaluation of end-to-end results composed of the detection and recognition as well as individual evaluations for each module from the end-performance perspective.
- Score: 18.25936871944743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent success of text detection and recognition methods,
existing evaluation metrics fail to provide a fair and reliable comparison
among those methods. In addition, there exists no end-to-end evaluation metric
that takes characteristics of OCR tasks into account. Previous end-to-end
metric contains cascaded errors from the binary scoring process applied in both
detection and recognition tasks. Ignoring partially correct results raises a
gap between quantitative and qualitative analysis, and prevents fine-grained
assessment. Based on the fact that character is a key element of text, we
hereby propose a Character-Level Evaluation metric (CLEval). In CLEval, the
\textit{instance matching} process handles split and merge detection cases, and
the \textit{scoring process} conducts character-level evaluation. By
aggregating character-level scores, the CLEval metric provides a fine-grained
evaluation of end-to-end results composed of the detection and recognition as
well as individual evaluations for each module from the end-performance
perspective. We believe that our metrics can play a key role in developing and
analyzing state-of-the-art text detection and recognition methods. The
evaluation code is publicly available at https://github.com/clovaai/CLEval.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Check-Eval: A Checklist-based Approach for Evaluating Text Quality [3.031375888004876]
textscCheck-Eval can be employed as both a reference-free and reference-dependent evaluation method.
textscCheck-Eval achieves higher correlations with human judgments compared to existing metrics.
arXiv Detail & Related papers (2024-07-19T17:14:16Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion [78.76867266561537]
The evaluation process still heavily relies on closed-set metrics without considering the similarity between predicted and ground truth categories.
To tackle this issue, we first survey eleven similarity measurements between two categorical words.
We designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks.
arXiv Detail & Related papers (2023-11-06T18:59:01Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - APPLS: Evaluating Evaluation Metrics for Plain Language Summarization [18.379461020500525]
This study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for Plain Language Summarization (PLS)
We identify four PLS criteria from previous work and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect.
Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations.
arXiv Detail & Related papers (2023-05-23T17:59:19Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - End-to-End Page-Level Assessment of Handwritten Text Recognition [69.55992406968495]
HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
arXiv Detail & Related papers (2023-01-14T15:43:07Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Perception Score, A Learned Metric for Open-ended Text Generation
Evaluation [62.7690450616204]
We propose a novel and powerful learning-based evaluation metric: Perception Score.
The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping.
arXiv Detail & Related papers (2020-08-07T10:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.