QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation
- URL: http://arxiv.org/abs/2210.04320v1
- Date: Sun, 9 Oct 2022 19:00:39 GMT
- Title: QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation
- Authors: Tianbo Ji, Chenyang Lyu, Gareth Jones, Liting Zhou, Yvette Graham
- Abstract summary: Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
- Score: 6.697751970080859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Question Generation (QG) aims to automate the task of composing questions for
a passage with a set of chosen answers found within the passage. In recent
years, the introduction of neural generation models has resulted in substantial
improvements of automatically generated questions in terms of quality,
especially compared to traditional approaches that employ manually crafted
heuristics. However, the metrics commonly applied in QG evaluations have been
criticized for their low agreement with human judgement. We therefore propose a
new reference-free evaluation metric that has the potential to provide a better
mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a
language model to maximize its correlation with human judgements, QAScore
evaluates a question by computing the cross entropy according to the
probability that the language model can correctly generate the masked words in
the answer to that question. Furthermore, we conduct a new crowd-sourcing human
evaluation experiment for the QG evaluation to investigate how QAScore and
other metrics can correlate with human judgements. Experiments show that
QAScore obtains a stronger correlation with the results of our proposed human
evaluation method compared to existing traditional word-overlap-based metrics
such as BLEU and ROUGE, as well as the existing pretrained-model-based metric
BERTScore.
Related papers
- MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation [0.4857223913212445]
We propose a novel system, MIRROR, to automate the evaluation process for questions generated by automated question generation systems.
We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR.
arXiv Detail & Related papers (2024-10-16T12:24:42Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Automatic Answerability Evaluation for Question Generation [32.1067137848404]
This work proposes PMAN, a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers.
Our implementation of a GPT-based QG model achieves state-of-the-art performance in generating answerable questions.
arXiv Detail & Related papers (2023-09-22T00:13:07Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Evaluation of Question Generation Needs More References [7.876222232341623]
We propose to paraphrase the reference question for a more robust QG evaluation.
Using large language models such as GPT-3, we created semantically and syntactically diverse questions.
arXiv Detail & Related papers (2023-05-26T04:40:56Z) - Learning Answer Generation using Supervision from Automatic Question
Answering Evaluators [98.9267570170737]
We propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA)
We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.
arXiv Detail & Related papers (2023-05-24T16:57:04Z) - Improving Visual Question Answering Models through Robustness Analysis
and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models.
The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.