QRelScore: Better Evaluating Generated Questions with Deeper
Understanding of Context-aware Relevance
- URL: http://arxiv.org/abs/2204.13921v1
- Date: Fri, 29 Apr 2022 07:39:53 GMT
- Title: QRelScore: Better Evaluating Generated Questions with Deeper
Understanding of Context-aware Relevance
- Authors: Xiaoqiang Wang, Bang Liu, Siliang Tang, Lingfei Wu
- Abstract summary: We propose $textbfQRelScore$, a context-aware evaluation metric for $underlinetextbfRel$evance evaluation metric.
Based on off-the-shelf language models such as BERT and GPT2, QRelScore employs both word-level hierarchical matching and sentence-level prompt-based generation.
Compared with existing metrics, our experiments demonstrate that QRelScore is able to achieve a higher correlation with human judgments while being much more robust to adversarial samples.
- Score: 54.48031346496593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing metrics for assessing question generation not only require costly
human reference but also fail to take into account the input context of
generation, rendering the lack of deep understanding of the relevance between
the generated questions and input contexts. As a result, they may wrongly
penalize a legitimate and reasonable candidate question when it (i) involves
complicated reasoning with the context or (ii) can be grounded by multiple
evidences in the context. In this paper, we propose $\textbf{QRelScore}$, a
context-aware $\underline{\textbf{Rel}}$evance evaluation metric for
$\underline{\textbf{Q}}$uestion Generation. Based on off-the-shelf language
models such as BERT and GPT2, QRelScore employs both word-level hierarchical
matching and sentence-level prompt-based generation to cope with the
complicated reasoning and diverse generation from multiple evidences,
respectively. Compared with existing metrics, our experiments demonstrate that
QRelScore is able to achieve a higher correlation with human judgments while
being much more robust to adversarial samples.
Related papers
- RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries.
This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z) - QUDSELECT: Selective Decoding for Questions Under Discussion Parsing [90.92351108691014]
Question Under Discussion (QUD) is a discourse framework that uses implicit questions to reveal discourse relationships between sentences.
We introduce QUDSELECT, a joint-training framework that selectively decodes the QUD dependency structures considering the QUD criteria.
Our method outperforms the state-of-the-art baseline models by 9% in human evaluation and 4% in automatic evaluation.
arXiv Detail & Related papers (2024-08-02T06:46:08Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - SkillQG: Learning to Generate Question for Reading Comprehension
Assessment [54.48031346496593]
We present a question generation framework with controllable comprehension types for assessing and improving machine reading comprehension models.
We first frame the comprehension type of questions based on a hierarchical skill-based schema, then formulate $textttSkillQG$ as a skill-conditioned question generator.
Empirical results demonstrate that $textttSkillQG$ outperforms baselines in terms of quality, relevance, and skill-controllability.
arXiv Detail & Related papers (2023-05-08T14:40:48Z) - RQUGE: Reference-Free Metric for Evaluating Question Generation by
Answering the Question [29.18544401904503]
We propose a new metric, RQUGE, based on the answerability of the candidate question given the context.
We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question.
arXiv Detail & Related papers (2022-11-02T21:10:09Z) - Revisiting the Evaluation Metrics of Paraphrase Generation [35.6803390044542]
Most existing paraphrase generation models use reference-based metrics to evaluate their generated paraphrase.
This paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
arXiv Detail & Related papers (2022-02-17T07:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.