Related papers: Improving Applicability of Deep Learning based Token Classification models during Training

Improving Applicability of Deep Learning based Token Classification models during Training

URL: http://arxiv.org/abs/2504.01028v1
Date: Fri, 28 Mar 2025 17:01:19 GMT
Title: Improving Applicability of Deep Learning based Token Classification models during Training
Authors: Anket Mehra, Malte Prieß, Marian Himstedt,
Abstract summary: We show that classification metrics, represented by the F1-Score, are insufficient for evaluating the applicability of machine learning models in practice.<n>We introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper shows that further evaluation metrics during model training are needed to decide about its applicability in inference. As an example, a LayoutLM-based model is trained for token classification in documents. The documents are German receipts. We show that conventional classification metrics, represented by the F1-Score in our experiments, are insufficient for evaluating the applicability of machine learning models in practice. To address this problem, we introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task. To the best of our knowledge, nothing comparable has been introduced in this context. DIP is a rigorous metric, describing how many documents of the test dataset require manual interventions. It enables AI researchers and software developers to conduct an in-depth investigation of the level of process automation in business software. In order to validate DIP, we conduct experiments with our created models to highlight and analyze the impact and relevance of DIP to evaluate if the model should be deployed or not in different training settings. Our results demonstrate that existing metrics barely change for isolated model impairments, whereas DIP indicates that the model requires substantial human interventions in deployment. The larger the set of entities being predicted, the less sensitive conventional metrics are, entailing poor automation quality. DIP, in contrast, remains a single value to be interpreted for entire entity sets. This highlights the importance of having metrics that focus on the business task for model training in production. Since DIP is created for the token classification task, more research is needed to find suitable metrics for other training tasks.

Related papers

DocTTT: Test-Time Training for Handwritten Document Recognition Using Meta-Auxiliary Learning [7.036629164442979]
We introduce the DocTTT framework to address these challenges.<n>Key innovation of our approach is that it uses test-time training to adapt the model to each specific input during testing.<n>We propose a novel Meta-Auxiliary learning approach that combines Meta-learning and self-supervised Masked Autoencoder(MAE)
arXiv Detail & Related papers (2025-01-22T14:18:47Z)
Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.<n>These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
When is an Embedding Model More Promising than Another? [33.540506562970776]
Embedders play a central role in machine learning, projecting any object into numerical representations that can be leveraged to perform various downstream tasks. The evaluation of embedding models typically depends on domain-specific empirical approaches. We present a unified approach to evaluate embedders, drawing upon the concepts of sufficiency and informativeness.
arXiv Detail & Related papers (2024-06-11T18:13:46Z)
A Novel Metric for Measuring Data Quality in Classification Applications (extended version) [0.0]
We introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews [2.66512000865131]
We study the accuracy and time efficiency of pre-trained neural language models (PTMs) for app review classification. We set up different studies to evaluate PTMs in multiple settings. In all cases, Micro and Macro Precision, Recall, and F1-scores will be used.
arXiv Detail & Related papers (2021-04-12T23:23:45Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.