Related papers: Evaluating the Evaluation of Diversity in Commonsense Generation

Evaluating the Evaluation of Diversity in Commonsense Generation

URL: http://arxiv.org/abs/2506.00514v1
Date: Sat, 31 May 2025 11:18:26 GMT
Title: Evaluating the Evaluation of Diversity in Commonsense Generation
Authors: Tianhui Zhang, Bei Peng, Danushka Bollegala,
Abstract summary: We conduct a systematic meta-evaluation of diversity metrics for commonsense generation.<n>We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets.<n>We show that content-based diversity evaluation metrics consistently outperform the form-based counterparts.
Score: 28.654890118684957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In commonsense generation, given a set of input concepts, a model must generate a response that is not only commonsense bearing, but also capturing multiple diverse viewpoints. Numerous evaluation metrics based on form- and content-level overlap have been proposed in prior work for evaluating the diversity of a commonsense generation model. However, it remains unclear as to which metrics are best suited for evaluating the diversity in commonsense generation. To address this gap, we conduct a systematic meta-evaluation of diversity metrics for commonsense generation. We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets, where even randomly generated sentences are assigned overly high diversity scores. We then use an Large Language Model (LLM) to create a novel dataset annotated for the diversity of sentences generated for a commonsense generation task, and use it to conduct a meta-evaluation of the existing diversity evaluation metrics. Our experimental results show that content-based diversity evaluation metrics consistently outperform the form-based counterparts, showing high correlations with the LLM-based ratings. We recommend that future work on commonsense generation should use content-based metrics for evaluating the diversity of their outputs.

Related papers

Benchmarking Diversity in Image Generation via Attribute-Conditional Human Evaluation [11.51556047408882]
Current text-to-image (T2I) models often lack diversity, generating homogeneous outputs.<n>This work introduces a framework to address the need for robust diversity evaluation in T2I models.
arXiv Detail & Related papers (2025-11-13T17:48:38Z)
Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting [19.79214899011072]
This paper formalizes diversity of representation in generative large language models. We present evaluation datasets and propose metrics to measure diversity in generated responses along people and culture axes. We find that LLMs understand the notion of diversity, and that they can reason and critique their own responses for that goal.
arXiv Detail & Related papers (2023-10-25T10:17:17Z)
Diversify Question Generation with Retrieval-Augmented Style Transfer [68.00794669873196]
We propose RAST, a framework for Retrieval-Augmented Style Transfer. The objective is to utilize the style of diverse templates for question generation. We develop a novel Reinforcement Learning (RL) based approach that maximizes a weighted combination of diversity reward and consistency reward.
arXiv Detail & Related papers (2023-10-23T02:27:31Z)
Exploring Diversity in Back Translation for Low-Resource Machine Translation [85.03257601325183]
Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity.
arXiv Detail & Related papers (2022-06-01T15:21:16Z)
Semantic Diversity in Dialogue with Natural Language Inference [19.74618235525502]
This paper makes two substantial contributions to improving diversity in dialogue generation. First, we propose a novel metric which uses Natural Language Inference (NLI) to measure the semantic diversity of a set of model responses for a conversation. Second, we demonstrate how to iteratively improve the semantic diversity of a sampled set of responses via a new generation procedure called Diversity Threshold Generation.
arXiv Detail & Related papers (2022-05-03T13:56:32Z)
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards) Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z)
Random Network Distillation as a Diversity Metric for Both Image and Text Generation [62.13444904851029]
We develop a new diversity metric that can be applied to data, both synthetic and natural, of any type. We validate and deploy this metric on both images and text.
arXiv Detail & Related papers (2020-10-13T22:03:52Z)
Mark-Evaluate: Assessing Language Generation using Population Estimation Methods [6.307450687141434]
We propose a family of metrics to assess language generation derived from population estimation methods widely used in ecology. In synthetic experiments, our family of methods is sensitive to drops in quality and diversity. Our methods show a higher correlation to human evaluation than existing metrics on several challenging tasks.
arXiv Detail & Related papers (2020-10-09T14:31:53Z)
Evaluating for Diversity in Question Generation over Text [5.369031521471668]
We argue that commonly-used evaluation metrics such as BLEU and METEOR are not suitable for this task due to the inherent diversity of reference questions. We propose a variational encoder-decoder model for this task.
arXiv Detail & Related papers (2020-08-17T13:16:12Z)
On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation [86.11292297348622]
We show that a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution. We propose CR/NRR as a substitute for quality/diversity metric pair.
arXiv Detail & Related papers (2020-07-03T04:06:59Z)
Informed Sampling for Diversity in Concept-to-Text NLG [8.883733362171034]
We propose an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce. Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output.
arXiv Detail & Related papers (2020-04-29T17:43:24Z)
Evaluating the Evaluation of Diversity in Natural Language Generation [43.05127848086264]
We propose a framework for evaluating diversity metrics in natural language generation systems. Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.
arXiv Detail & Related papers (2020-04-06T20:44:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.