KPQA: A Metric for Generative Question Answering Using Keyphrase Weights
- URL: http://arxiv.org/abs/2005.00192v3
- Date: Thu, 15 Apr 2021 10:09:41 GMT
- Title: KPQA: A Metric for Generative Question Answering Using Keyphrase Weights
- Authors: Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung
Bui, Joongbo Shin and Kyomin Jung
- Abstract summary: KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
- Score: 64.54593491919248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the automatic evaluation of generative question answering (GenQA) systems,
it is difficult to assess the correctness of generated answers due to the
free-form of the answer. Especially, widely used n-gram similarity metrics
often fail to discriminate the incorrect answers since they equally consider
all of the tokens. To alleviate this problem, we propose KPQA-metric, a new
metric for evaluating the correctness of GenQA. Specifically, our new metric
assigns different weights to each token via keyphrase prediction, thereby
judging whether a generated answer sentence captures the key meaning of the
reference answer. To evaluate our metric, we create high-quality human
judgments of correctness on two GenQA datasets. Using our human-evaluation
datasets, we show that our proposed metric has a significantly higher
correlation with human judgments than existing metrics. The code is available
at https://github.com/hwanheelee1993/KPQA.
Related papers
- RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries.
This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation [9.001613702628253]
Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics.
There is a lack of unified human evaluation criteria, which hampers consistent evaluations of both QG models and automatic metrics.
We propose QGEval, a multi-dimensional Evaluation benchmark for Question Generation, which evaluates both generated questions and existing automatic metrics across 7 dimensions.
arXiv Detail & Related papers (2024-06-09T09:51:55Z) - Reference-based Metrics Disprove Themselves in Question Generation [17.83616985138126]
We find that using human-written references cannot guarantee the effectiveness of reference-based metrics.
A good metric is expected to grade a human-validated question no worse than generated questions.
We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity.
arXiv Detail & Related papers (2024-03-18T20:47:10Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - RQUGE: Reference-Free Metric for Evaluating Question Generation by
Answering the Question [29.18544401904503]
We propose a new metric, RQUGE, based on the answerability of the candidate question given the context.
We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question.
arXiv Detail & Related papers (2022-11-02T21:10:09Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - Benchmarking Answer Verification Methods for Question Answering-Based
Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not.
We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.