RQUGE: Reference-Free Metric for Evaluating Question Generation by
Answering the Question
- URL: http://arxiv.org/abs/2211.01482v3
- Date: Fri, 26 May 2023 14:28:20 GMT
- Title: RQUGE: Reference-Free Metric for Evaluating Question Generation by
Answering the Question
- Authors: Alireza Mohammadshahi and Thomas Scialom and Majid Yazdani and Pouya
Yanki and Angela Fan and James Henderson and Marzieh Saeidi
- Abstract summary: We propose a new metric, RQUGE, based on the answerability of the candidate question given the context.
We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question.
- Score: 29.18544401904503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing metrics for evaluating the quality of automatically generated
questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and
predicted questions, providing a high score when there is a considerable
lexical overlap or semantic similarity between the candidate and the reference
questions. This approach has two major shortcomings. First, we need expensive
human-provided reference questions. Second, it penalises valid questions that
may not have high lexical or semantic similarity to the reference questions. In
this paper, we propose a new metric, RQUGE, based on the answerability of the
candidate question given the context. The metric consists of a
question-answering and a span scorer modules, using pre-trained models from
existing literature, thus it can be used without any further training. We
demonstrate that RQUGE has a higher correlation with human judgment without
relying on the reference question. Additionally, RQUGE is shown to be more
robust to several adversarial corruptions. Furthermore, we illustrate that we
can significantly improve the performance of QA models on out-of-domain
datasets by fine-tuning on synthetic data generated by a question generation
model and re-ranked by RQUGE.
Related papers
- RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries.
This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - Reference-based Metrics Disprove Themselves in Question Generation [17.83616985138126]
We find that using human-written references cannot guarantee the effectiveness of reference-based metrics.
A good metric is expected to grade a human-validated question no worse than generated questions.
We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity.
arXiv Detail & Related papers (2024-03-18T20:47:10Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - QRelScore: Better Evaluating Generated Questions with Deeper
Understanding of Context-aware Relevance [54.48031346496593]
We propose $textbfQRelScore$, a context-aware evaluation metric for $underlinetextbfRel$evance evaluation metric.
Based on off-the-shelf language models such as BERT and GPT2, QRelScore employs both word-level hierarchical matching and sentence-level prompt-based generation.
Compared with existing metrics, our experiments demonstrate that QRelScore is able to achieve a higher correlation with human judgments while being much more robust to adversarial samples.
arXiv Detail & Related papers (2022-04-29T07:39:53Z) - A Wrong Answer or a Wrong Question? An Intricate Relationship between
Question Reformulation and Answer Selection in Conversational Question
Answering [15.355557454305776]
We show that question rewriting (QR) of the conversational context allows to shed more light on this phenomenon.
We present the results of this analysis on the TREC CAsT and QuAC (CANARD) datasets.
arXiv Detail & Related papers (2020-10-13T06:29:51Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.