Evaluation of Question Generation Needs More References
- URL: http://arxiv.org/abs/2305.16626v1
- Date: Fri, 26 May 2023 04:40:56 GMT
- Title: Evaluation of Question Generation Needs More References
- Authors: Shinhyeok Oh, Hyojun Go, Hyeongdon Moon, Yunsung Lee, Myeongho Jeong,
Hyun Seung Lee and Seungtaek Choi
- Abstract summary: We propose to paraphrase the reference question for a more robust QG evaluation.
Using large language models such as GPT-3, we created semantically and syntactically diverse questions.
- Score: 7.876222232341623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question generation (QG) is the task of generating a valid and fluent
question based on a given context and the target answer. According to various
purposes, even given the same context, instructors can ask questions about
different concepts, and even the same concept can be written in different ways.
However, the evaluation for QG usually depends on single reference-based
similarity metrics, such as n-gram-based metric or learned metric, which is not
sufficient to fully evaluate the potential of QG methods. To this end, we
propose to paraphrase the reference question for a more robust QG evaluation.
Using large language models such as GPT-3, we created semantically and
syntactically diverse questions, then adopt the simple aggregation of the
popular evaluation metrics as the final scores. Through our experiments, we
found that using multiple (pseudo) references is more effective for QG
evaluation while showing a higher correlation with human evaluations than
evaluation with a single reference.
Related papers
- LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers.
The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z) - Reference-based Metrics Disprove Themselves in Question Generation [17.83616985138126]
We find that using human-written references cannot guarantee the effectiveness of reference-based metrics.
A good metric is expected to grade a human-validated question no worse than generated questions.
We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity.
arXiv Detail & Related papers (2024-03-18T20:47:10Z) - Automatic Answerability Evaluation for Question Generation [32.1067137848404]
This work proposes PMAN, a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers.
Our implementation of a GPT-based QG model achieves state-of-the-art performance in generating answerable questions.
arXiv Detail & Related papers (2023-09-22T00:13:07Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - Generative Language Models for Paragraph-Level Question Generation [79.31199020420827]
Powerful generative models have led to recent progress in question generation (QG)
It is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches.
We introduce QG-Bench, a benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting.
arXiv Detail & Related papers (2022-10-08T10:24:39Z) - On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question
Generation [57.630606799713526]
We study the task of predicting a set of salient questions from a given paragraph without any prior knowledge of the precise answer.
First, we propose a new method to evaluate a set of predicted questions against the set of references by using the Hungarian algorithm to assign predicted questions to references before scoring the assigned pairs.
Second, we compare different strategies to utilize a pre-trained seq2seq model to generate and select a set of questions related to a given paragraph.
arXiv Detail & Related papers (2022-03-09T00:55:54Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.