Related papers: Evaluation of HTR models without Ground Truth Material

Evaluation of HTR models without Ground Truth Material

URL: http://arxiv.org/abs/2201.06170v1
Date: Mon, 17 Jan 2022 01:26:09 GMT
Title: Evaluation of HTR models without Ground Truth Material
Authors: Phillip Benjamin Str\"obel, Simon Clematide, Martin Volk, Raphael Schwitter, Tobias Hodel, David Schoch
Abstract summary: evaluation of Handwritten Text Recognition models during their development is straightforward. But the evaluation process becomes tricky as soon as we switch from development to application. We show that lexicon-based evaluation can compete with lexicon-based methods.
Score: 2.4792948967354236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.

Related papers

Using (Not so) Large Language Models for Generating Simulation Models in a Formal DSL -- A Study on Reaction Networks [0.0]
We evaluate how a Large Language Model might be used for formalizing natural language into simulation models. We develop a synthetic data generator to serve as the basis for fine-tuning and evaluation. Our evaluation shows that our fine-tuned Mistral model can recover the ground truth simulation model in up to 84.5% of cases.
arXiv Detail & Related papers (2025-03-03T15:48:01Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations. We identify and review the varying factors in evaluation practices adopted by the community. OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS) We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting. Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models [20.09470051458651]
We introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature.
arXiv Detail & Related papers (2023-05-18T17:58:35Z)
Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z)
Efficient Training of Language Models to Fill in the Middle [17.118891860985123]
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
arXiv Detail & Related papers (2022-07-28T17:40:47Z)
Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors [19.653423881863834]
Machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. We compare three recently proposed source code summarization models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy.
arXiv Detail & Related papers (2021-06-15T20:13:14Z)
Coarse-to-Fine Memory Matching for Joint Retrieval and Classification [0.7081604594416339]
We present a novel end-to-end language model for joint retrieval and classification. We evaluate it on the standard blind test set of the FEVER fact verification dataset. We extend exemplar auditing to this setting for analyzing and constraining the model.
arXiv Detail & Related papers (2020-11-29T05:06:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.