A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics
- URL: http://arxiv.org/abs/2410.10030v1
- Date: Sun, 13 Oct 2024 22:10:42 GMT
- Title: A Step Towards Mixture of Grader: Statistical Analysis of Existing Automatic Evaluation Metrics
- Authors: Yun Joon Soh, Jishen Zhao,
- Abstract summary: We study the statistics of the existing evaluation metrics for a better understanding of their limitations.
As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
- Score: 6.571049277167304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The explosion of open-sourced models and Question-Answering (QA) datasets emphasizes the importance of automated QA evaluation. We studied the statistics of the existing evaluation metrics for a better understanding of their limitations. By measuring the correlation coefficients of each evaluation metric concerning human-like evaluation score, we observed the following: (1) existing metrics have a high correlation among them concerning the question type (e.g., single word, single phrase, etc.), (2) no single metric can adequately estimate the human-like evaluation. As a potential solution, we discuss how a Mixture Of Grader could potentially improve the auto QA evaluator quality.
Related papers
- Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods.
We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z) - IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering [10.338962367542331]
In this work, we introduce an automatic evaluation framework IQA-EVAL to Interactive Question Answering Evaluation.
More specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions.
We show that our evaluation framework with GPT-4 as the backbone model achieves a high correlation with human evaluations on the IQA task.
arXiv Detail & Related papers (2024-08-24T10:34:20Z) - An Automatic Question Usability Evaluation Toolkit [1.2499537119440245]
evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability.
We introduce SAQUET, an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs.
With an accuracy rate of over 94%, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
arXiv Detail & Related papers (2024-05-30T23:04:53Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.