Evaluating for Diversity in Question Generation over Text
- URL: http://arxiv.org/abs/2008.07291v1
- Date: Mon, 17 Aug 2020 13:16:12 GMT
- Title: Evaluating for Diversity in Question Generation over Text
- Authors: Michael Sejr Schlichtkrull, Weiwei Cheng
- Abstract summary: We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions.
We propose a variational encoder-decoder model for this task.
- Score: 5.369031521471668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating diverse and relevant questions over text is a task with widespread
applications. We argue that commonly-used evaluation metrics such as BLEU and
METEOR are not suitable for this task due to the inherent diversity of
reference questions, and propose a scheme for extending conventional metrics to
reflect diversity. We furthermore propose a variational encoder-decoder model
for this task. We show through automatic and human evaluation that our
variational model improves diversity without loss of quality, and demonstrate
how our evaluation scheme reflects this improvement.
Related papers
- OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Improving Diversity of Demographic Representation in Large Language
Models via Collective-Critiques and Self-Voting [19.79214899011072]
This paper formalizes diversity of representation in generative large language models.
We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes.
We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal.
arXiv Detail & Related papers (2023-10-25T10:17:17Z) - Diversify Question Generation with Retrieval-Augmented Style Transfer [68.00794669873196]
We propose RAST, a framework for Retrieval-Augmented Style Transfer.
The objective is to utilize the style of diverse templates for question generation.
We develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward.
arXiv Detail & Related papers (2023-10-23T02:27:31Z) - Diversifying Question Generation over Knowledge Base via External
Natural Questions [18.382095354733842]
We argue that diverse texts should convey the same semantics through varied expressions.
Current metrics inadequately assess the above diversity since they calculate the ratio of unique n-grams in the generated question itself.
We devise a new diversity evaluation metric, which measures the diversity among top-k generated questions for each instance.
arXiv Detail & Related papers (2023-09-23T10:37:57Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Measuring and Improving Semantic Diversity of Dialogue Generation [21.59385143783728]
We introduce a new automatic evaluation metric to measure the semantic diversity of generated responses.
We show that our proposed metric captures human judgments on response diversity better than existing lexical-level diversity metrics.
We also propose a simple yet effective learning method that improves the semantic diversity of generated responses.
arXiv Detail & Related papers (2022-10-11T18:36:54Z) - Learning to Diversify for Product Question Generation [68.69526529887607]
We show how the T5 pre-trained Transformer encoder-decoder model can be fine-tuned for the task.
We propose a novel learning-to-diversify (LTD) fine-tuning approach that allows to enrich the language learned by the underlying Transformer model.
arXiv Detail & Related papers (2022-07-06T09:26:41Z) - On the Relation between Quality-Diversity Evaluation and
Distribution-Fitting Goal in Text Generation [86.11292297348622]
We show that a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution.
We propose CR/NRR as a substitute for quality/diversity metric pair.
arXiv Detail & Related papers (2020-07-03T04:06:59Z) - Evaluating the Evaluation of Diversity in Natural Language Generation [43.05127848086264]
We propose a framework for evaluating diversity metrics in natural language generation systems.
Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.
arXiv Detail & Related papers (2020-04-06T20:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.