Related papers: KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

URL: http://arxiv.org/abs/2005.00192v3
Date: Thu, 15 Apr 2021 10:09:41 GMT
Title: KPQA: A Metric for Generative Question Answering Using Keyphrase Weights
Authors: Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Joongbo Shin and Kyomin Jung
Abstract summary: KPQA-metric is a new metric for evaluating correctness of generative question answering systems. Our new metric assigns different weights to each token via keyphrase prediction. We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
Score: 64.54593491919248
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA-metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics. The code is available at https://github.com/hwanheelee1993/KPQA.

Related papers

What should an AI assessor optimise for? [57.96463917842822]
An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system.<n>Here we address the question: is it always optimal to train the assessor for the target metric?<n>We experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings.
arXiv Detail & Related papers (2025-02-01T08:41:57Z)
RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation [9.001613702628253]
Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics. There is a lack of unified human evaluation criteria, which hampers consistent evaluations of both QG models and automatic metrics. We propose QGEval, a multi-dimensional Evaluation benchmark for Question Generation, which evaluates both generated questions and existing automatic metrics across 7 dimensions.
arXiv Detail & Related papers (2024-06-09T09:51:55Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE) QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Reference-based Metrics Disprove Themselves in Question Generation [17.83616985138126]
We find that using human-written references cannot guarantee the effectiveness of reference-based metrics. A good metric is expected to grade a human-validated question no worse than generated questions. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity.
arXiv Detail & Related papers (2024-03-18T20:47:10Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question [29.18544401904503]
We propose a new metric, RQUGE, based on the answerability of the candidate question given the context. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question.
arXiv Detail & Related papers (2022-11-02T21:10:09Z)
QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers. We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z)
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not. We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.