Accurate and Nuanced Open-QA Evaluation Through Textual Entailment
- URL: http://arxiv.org/abs/2405.16702v1
- Date: Sun, 26 May 2024 21:33:27 GMT
- Title: Accurate and Nuanced Open-QA Evaluation Through Textual Entailment
- Authors: Peiran Yao, Denilson Barbosa,
- Abstract summary: We propose to study the entailment relations of answers to identify more informative and more general system answers.
The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
- Score: 4.762213968673381
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators. Complex evaluators, powered by foundation models or LLMs and pertaining to semantic equivalence, still deviate from human judgments by a large margin. We propose to study the entailment relations of answers to identify more informative and more general system answers, offering a much closer evaluation to human judgment on both NaturalQuestions and TriviaQA while being learning-free. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers, enabling a nuanced ranking of answer correctness that has higher AUC than current methods.
Related papers
- LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering [14.366087533102656]
Question answering (QA) can only make progress if we know if an answer is correct.
Current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments.
arXiv Detail & Related papers (2024-01-24T01:30:25Z) - QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing [87.20804165014387]
Questions Under Discussion (QUD) is a versatile linguistic framework in which discourse progresses as continuously asking questions and answering them.
This work introduces the first framework for the automatic evaluation of QUD parsing.
We present QUDeval, a dataset of fine-grained evaluation of 2,190 QUD questions generated from both fine-tuned systems and LLMs.
arXiv Detail & Related papers (2023-10-23T03:03:58Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - A Critical Evaluation of Evaluations for Long-form Question Answering [48.51361567469683]
Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation.
We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices.
arXiv Detail & Related papers (2023-05-29T16:54:24Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - Evaluating Open-QA Evaluation [29.43815593419996]
This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs)
We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA.
arXiv Detail & Related papers (2023-05-21T10:40:55Z) - Multifaceted Improvements for Conversational Open-Domain Question
Answering [54.913313912927045]
We propose a framework with Multifaceted Improvements for Conversational open-domain Question Answering (MICQA)
Firstly, the proposed KL-divergence based regularization is able to lead to a better question understanding for retrieval and answer reading.
Second, the added post-ranker module can push more relevant passages to the top placements and be selected for reader with a two-aspect constrains.
Third, the well designed curriculum learning strategy effectively narrows the gap between the golden passage settings of training and inference, and encourages the reader to find true answer without the golden passage assistance.
arXiv Detail & Related papers (2022-04-01T07:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.