Evaluating Open-QA Evaluation
- URL: http://arxiv.org/abs/2305.12421v4
- Date: Mon, 23 Oct 2023 14:26:26 GMT
- Title: Evaluating Open-QA Evaluation
- Authors: Cunxiang Wang, Sirui Cheng, Qipeng Guo, Yuanhao Yue, Bowen Ding,
Zhikun Xu, Yidong Wang, Xiangkun Hu, Zheng Zhang, Yue Zhang
- Abstract summary: This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs)
We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA.
- Score: 29.43815593419996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study focuses on the evaluation of the Open Question Answering (Open-QA)
task, which can directly estimate the factuality of large language models
(LLMs). Current automatic evaluation methods have shown limitations, indicating
that human evaluation still remains the most reliable approach. We introduce a
new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset
EVOUNA, designed to assess the accuracy of AI-generated answers in relation to
standard answers within Open-QA. Our evaluation of these methods utilizes
human-annotated results to measure their performance. Specifically, the work
investigates methods that show high correlation with human evaluations, deeming
them more reliable. We also discuss the pitfalls of current methods and methods
to improve LLM-based evaluators. We believe this new QA-Eval task and
corresponding dataset EVOUNA will facilitate the development of more effective
automatic evaluation tools and prove valuable for future research in this area.
All resources are available at \url{https://github.com/wangcunxiang/QA-Eval}
and it is under the Apache-2.0 License.
Related papers
- IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering [10.338962367542331]
In this work, we introduce an automatic evaluation framework IQA-EVAL to Interactive Question Answering Evaluation.
More specifically, we introduce LLM-based Evaluation Agent (LEA) that can: (1) simulate human behaviors to generate interactions with IQA models; (2) automatically evaluate the generated interactions.
We show that our evaluation framework with GPT-4 as the backbone model achieves a high correlation with human evaluations on the IQA task.
arXiv Detail & Related papers (2024-08-24T10:34:20Z) - Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers.
The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - The Generative AI Paradox on Evaluation: What It Can Solve, It May Not
Evaluate [17.77014177096838]
This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators.
We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA dataset.
arXiv Detail & Related papers (2024-02-09T06:16:08Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - The Meta-Evaluation Problem in Explainable AI: Identifying Reliable
Estimators with MetaQuantus [10.135749005469686]
One of the unsolved challenges in the field of Explainable AI (XAI) is determining how to most reliably estimate the quality of an explanation method.
We address this issue through a meta-evaluation of different quality estimators in XAI.
Our novel framework, MetaQuantus, analyses two complementary performance characteristics of a quality estimator.
arXiv Detail & Related papers (2023-02-14T18:59:02Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.