Related papers: QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation

QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation

URL: http://arxiv.org/abs/2210.04320v1
Date: Sun, 9 Oct 2022 19:00:39 GMT
Title: QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation
Authors: Tianbo Ji, Chenyang Lyu, Gareth Jones, Liting Zhou, Yvette Graham
Abstract summary: Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers. We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
Score: 6.697751970080859
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers found within the passage. In recent years, the introduction of neural generation models has resulted in substantial improvements of automatically generated questions in terms of quality, especially compared to traditional approaches that employ manually crafted heuristics. However, the metrics commonly applied in QG evaluations have been criticized for their low agreement with human judgement. We therefore propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a language model to maximize its correlation with human judgements, QAScore evaluates a question by computing the cross entropy according to the probability that the language model can correctly generate the masked words in the answer to that question. Furthermore, we conduct a new crowd-sourcing human evaluation experiment for the QG evaluation to investigate how QAScore and other metrics can correlate with human judgements. Experiments show that QAScore obtains a stronger correlation with the results of our proposed human evaluation method compared to existing traditional word-overlap-based metrics such as BLEU and ROUGE, as well as the existing pretrained-model-based metric BERTScore.

Related papers

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation [0.4857223913212445]
We propose a novel system, MIRROR, to automate the evaluation process for questions generated by automated question generation systems. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR.
arXiv Detail & Related papers (2024-10-16T12:24:42Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z)
Automatic Answerability Evaluation for Question Generation [32.1067137848404]
This work proposes PMAN, a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers. Our implementation of a GPT-based QG model achieves state-of-the-art performance in generating answerable questions.
arXiv Detail & Related papers (2023-09-22T00:13:07Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Evaluation of Question Generation Needs More References [7.876222232341623]
We propose to paraphrase the reference question for a more robust QG evaluation. Using large language models such as GPT-3, we created semantically and syntactically diverse questions.
arXiv Detail & Related papers (2023-05-26T04:40:56Z)
Learning Answer Generation using Supervision from Automatic Question Answering Evaluators [98.9267570170737]
We propose a novel training paradigm for GenQA using supervision from automatic QA evaluation models (GAVA) We evaluate our proposed methods on two academic and one industrial dataset, obtaining a significant improvement in answering accuracy over the previous state of the art.
arXiv Detail & Related papers (2023-05-24T16:57:04Z)
Improving Visual Question Answering Models through Robustness Analysis and In-Context Learning with a Chain of Basic Questions [70.70725223310401]
This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models. The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models.
arXiv Detail & Related papers (2023-04-06T15:32:35Z)
KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems. Our new metric assigns different weights to each token via keyphrase prediction. We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.