End-to-End Page-Level Assessment of Handwritten Text Recognition
- URL: http://arxiv.org/abs/2301.05935v2
- Date: Sun, 21 May 2023 07:41:53 GMT
- Title: End-to-End Page-Level Assessment of Handwritten Text Recognition
- Authors: Enrique Vidal, Alejandro H. Toselli, Antonio R\'ios-Vila, Jorge
Calvo-Zaragoza
- Abstract summary: HTR systems increasingly face the end-to-end page-level transcription of a document.
Standard metrics do not take into account the inconsistencies that might appear.
We propose a two-fold evaluation, where the transcription accuracy and the RO goodness are considered separately.
- Score: 69.55992406968495
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The evaluation of Handwritten Text Recognition (HTR) systems has
traditionally used metrics based on the edit distance between HTR and ground
truth (GT) transcripts, at both the character and word levels. This is very
adequate when the experimental protocol assumes that both GT and HTR text lines
are the same, which allows edit distances to be independently computed to each
given line. Driven by recent advances in pattern recognition, HTR systems
increasingly face the end-to-end page-level transcription of a document, where
the precision of locating the different text lines and their corresponding
reading order (RO) play a key role. In such a case, the standard metrics do not
take into account the inconsistencies that might appear. In this paper, the
problem of evaluating HTR systems at the page level is introduced in detail. We
analyse the convenience of using a two-fold evaluation, where the transcription
accuracy and the RO goodness are considered separately. Different alternatives
are proposed, analysed and empirically compared both through partially
simulated and through real, full end-to-end experiments. Results support the
validity of the proposed two-fold evaluation approach. An important conclusion
is that such an evaluation can be adequately achieved by just two simple and
well-known metrics: the Word Error Rate (WER), that takes transcription
sequentiality into account, and the here re-formulated Bag of Words Word Error
Rate (bWER), that ignores order. While the latter directly and very accurately
assess intrinsic word recognition errors, the difference between both metrics
gracefully correlates with the Normalised Spearman's Foot Rule Distance (NSFD),
a metric which explicitly measures RO errors associated with layout analysis
flaws.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Aligning Speakers: Evaluating and Visualizing Text-based Diarization
Using Efficient Multiple Sequence Alignment (Extended Version) [21.325463387256807]
Two new metrics are proposed, Text-based Diarization Error Rate and Diarization F1, which perform utterance- and word-level evaluations.
Our metrics encompass more types of errors compared to existing ones, allowing us to make a more comprehensive analysis in speaker diarization.
arXiv Detail & Related papers (2023-09-14T12:43:26Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Cross-domain Speech Recognition with Unsupervised Character-level
Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains.
Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z) - BlonD: An Automatic Evaluation Metric for Document-level
MachineTranslation [47.691277066346665]
We propose an automatic metric BlonD for document-level machine translation evaluation.
BlonD takes discourse coherence into consideration by calculating the recall and distance of check-pointing phrases and tags.
arXiv Detail & Related papers (2021-03-22T14:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.