Automatic Answerability Evaluation for Question Generation
- URL: http://arxiv.org/abs/2309.12546v2
- Date: Mon, 26 Feb 2024 04:39:08 GMT
- Title: Automatic Answerability Evaluation for Question Generation
- Authors: Zifan Wang, Kotaro Funakoshi, Manabu Okumura
- Abstract summary: This work proposes PMAN, a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers.
Our implementation of a GPT-based QG model achieves state-of-the-art performance in generating answerable questions.
- Score: 32.1067137848404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional automatic evaluation metrics, such as BLEU and ROUGE, developed
for natural language generation (NLG) tasks, are based on measuring the n-gram
overlap between the generated and reference text. These simple metrics may be
insufficient for more complex tasks, such as question generation (QG), which
requires generating questions that are answerable by the reference answers.
Developing a more sophisticated automatic evaluation metric, thus, remains an
urgent problem in QG research. This work proposes PMAN (Prompting-based Metric
on ANswerability), a novel automatic evaluation metric to assess whether the
generated questions are answerable by the reference answers for the QG tasks.
Extensive experiments demonstrate that its evaluation results are reliable and
align with human evaluations. We further apply our metric to evaluate the
performance of QG models, which shows that our metric complements conventional
metrics. Our implementation of a GPT-based QG model achieves state-of-the-art
performance in generating answerable questions.
Related papers
- QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation [9.001613702628253]
Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics.
There is a lack of unified human evaluation criteria, which hampers consistent evaluations of both QG models and automatic metrics.
We propose QGEval, a multi-dimensional Evaluation benchmark for Question Generation, which evaluates both generated questions and existing automatic metrics across 7 dimensions.
arXiv Detail & Related papers (2024-06-09T09:51:55Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Evaluation of Question Generation Needs More References [7.876222232341623]
We propose to paraphrase the reference question for a more robust QG evaluation.
Using large language models such as GPT-3, we created semantically and syntactically diverse questions.
arXiv Detail & Related papers (2023-05-26T04:40:56Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - Quiz Design Task: Helping Teachers Create Quizzes with Automated
Question Generation [87.34509878569916]
This paper focuses on the use case of helping teachers automate the generation of reading comprehension quizzes.
In our study, teachers building a quiz receive question suggestions, which they can either accept or refuse with a reason.
arXiv Detail & Related papers (2022-05-03T18:59:03Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z) - Towards Automatic Generation of Questions from Long Answers [11.198653485869935]
We propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers.
We empirically demonstrate that the performance of existing AQG methods significantly degrades as the length of the answer increases.
Transformer-based methods outperform other existing AQG methods on long answers in terms of automatic as well as human evaluation.
arXiv Detail & Related papers (2020-04-10T16:45:08Z) - Asking and Answering Questions to Evaluate the Factual Consistency of
Summaries [80.65186293015135]
We propose an automatic evaluation protocol called QAGS (pronounced "kags") to identify factual inconsistencies in a generated summary.
QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source.
We believe QAGS is a promising tool in automatically generating usable and factually consistent text.
arXiv Detail & Related papers (2020-04-08T20:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.