ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
- URL: http://arxiv.org/abs/2402.03848v7
- Date: Tue, 27 Aug 2024 08:33:29 GMT
- Title: ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
- Authors: David Peer, Philemon Schöpf, Volckmar Nebendahl, Alexander Rietzler, Sebastian Stabinger,
- Abstract summary: This paper introduces a new metric for evaluating generative models called ANLS*.
The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores.
We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN.
- Score: 40.94659575657584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall into a limited number of predefined classes, facilitating a binary true or false evaluation and enabling the direct calculation of metrics such as the F1 score. However, recent advancements in generative large language models (GLLMs) have prompted a shift in the field due to their enhanced zero-shot capabilities, which eliminate the need for a downstream dataset and computationally expensive fine-tuning. However, evaluating GLLMs presents a challenge as the binary true or false evaluation used for discriminative models is not applicable to the predictions made by GLLMs. This paper introduces a new metric for generative models called ANLS* for evaluating a wide variety of tasks, including information extraction and classification tasks. The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores. An evaluation of 7 different datasets, and more than 10 different GLLMs together with 3 different prompting methods using the ANLS* metric is also provided, demonstrating the importance of the proposed metric. We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN. In almost all cases, SFT outperforms other techniques and improves the state-of-the-art, sometimes by as much as $10$ percentage points. Sources are available at https://github.com/deepopinion/anls_star_metric
Related papers
- LML-DAP: Language Model Learning a Dataset for Data-Augmented Prediction [0.0]
This paper introduces a new approach to using Large Language Models (LLMs) for classification tasks in an explainable way.
The proposed method uses the words "Act as an Explainable Machine Learning Model" in the prompt to enhance the interpretability of the predictions.
In some test cases, the system scored an accuracy above 90%, proving the effectiveness of the system.
arXiv Detail & Related papers (2024-09-27T17:58:50Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance [1.1961645395911131]
In domains where interpretability is crucial, such as medicine and law, standard models often fall short due to their "black-box" nature.
ICE-T addresses these limitations by using a series of generated prompts that allow an LLM to approach the problem from multiple directions.
We demonstrate the effectiveness of ICE-T across a diverse set of data sources, including medical records and legal documents.
arXiv Detail & Related papers (2024-05-08T19:20:34Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Token Prediction as Implicit Classification to Identify LLM-Generated
Text [37.89852204279844]
This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation.
Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task.
We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments.
arXiv Detail & Related papers (2023-11-15T06:33:52Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - Let's Predict Who Will Move to a New Job [0.0]
We discuss how machine learning is used to predict who will move to a new job.
Data is pre-processed into a suitable format for ML models.
Models are assessed using decision support metrics such as precision, recall, F1-Score, and accuracy.
arXiv Detail & Related papers (2023-09-15T11:43:09Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Meta-Generating Deep Attentive Metric for Few-shot Classification [53.07108067253006]
We present a novel deep metric meta-generation method to generate a specific metric for a new few-shot learning task.
In this study, we structure the metric using a three-layer deep attentive network that is flexible enough to produce a discriminative metric for each task.
We gain surprisingly obvious performance improvement over state-of-the-art competitors, especially in the challenging cases.
arXiv Detail & Related papers (2020-12-03T02:07:43Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.