Evaluating for Diversity in Question Generation over Text
- URL: http://arxiv.org/abs/2008.07291v1
- Date: Mon, 17 Aug 2020 13:16:12 GMT
- Title: Evaluating for Diversity in Question Generation over Text
- Authors: Michael Sejr Schlichtkrull, Weiwei Cheng
- Abstract summary: We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions.
We propose a variational encoder-decoder model for this task.
- Score: 5.369031521471668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating diverse and relevant questions over text is a task with widespread
applications. We argue that commonly-used evaluation metrics such as BLEU and
METEOR are not suitable for this task due to the inherent diversity of
reference questions, and propose a scheme for extending conventional metrics to
reflect diversity. We furthermore propose a variational encoder-decoder model
for this task. We show through automatic and human evaluation that our
variational model improves diversity without loss of quality, and demonstrate
how our evaluation scheme reflects this improvement.
Related papers
- DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction [53.803276766404494]
Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty.
We propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction.
Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.
arXiv Detail & Related papers (2024-12-12T18:52:40Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
OLMES is a documented, practical, open standard for reproducible language model evaluations.
It supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
OLMES includes well-considered, documented recommendations guided by results from existing literature as well as new experiments resolving open questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Improving Diversity of Demographic Representation in Large Language
Models via Collective-Critiques and Self-Voting [19.79214899011072]
This paper formalizes diversity of representation in generative large language models.
We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes.
We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal.
arXiv Detail & Related papers (2023-10-25T10:17:17Z) - Diversify Question Generation with Retrieval-Augmented Style Transfer [68.00794669873196]
We propose RAST, a framework for Retrieval-Augmented Style Transfer.
The objective is to utilize the style of diverse templates for question generation.
We develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward.
arXiv Detail & Related papers (2023-10-23T02:27:31Z) - Diversifying Question Generation over Knowledge Base via External
Natural Questions [18.382095354733842]
We argue that diverse texts should convey the same semantics through varied expressions.
Current metrics inadequately assess the above diversity since they calculate the ratio of unique n-grams in the generated question itself.
We devise a new diversity evaluation metric, which measures the diversity among top-k generated questions for each instance.
arXiv Detail & Related papers (2023-09-23T10:37:57Z) - Measuring and Improving Semantic Diversity of Dialogue Generation [21.59385143783728]
We introduce a new automatic evaluation metric to measure the semantic diversity of generated responses.
We show that our proposed metric captures human judgments on response diversity better than existing lexical-level diversity metrics.
We also propose a simple yet effective learning method that improves the semantic diversity of generated responses.
arXiv Detail & Related papers (2022-10-11T18:36:54Z) - Learning to Diversify for Product Question Generation [68.69526529887607]
We show how the T5 pre-trained Transformer encoder-decoder model can be fine-tuned for the task.
We propose a novel learning-to-diversify (LTD) fine-tuning approach that allows to enrich the language learned by the underlying Transformer model.
arXiv Detail & Related papers (2022-07-06T09:26:41Z) - On the Relation between Quality-Diversity Evaluation and
Distribution-Fitting Goal in Text Generation [86.11292297348622]
We show that a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution.
We propose CR/NRR as a substitute for quality/diversity metric pair.
arXiv Detail & Related papers (2020-07-03T04:06:59Z) - Evaluating the Evaluation of Diversity in Natural Language Generation [43.05127848086264]
We propose a framework for evaluating diversity metrics in natural language generation systems.
Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.
arXiv Detail & Related papers (2020-04-06T20:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.