Related papers: On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

Related papers

Statistical Multicriteria Evaluation of LLM-Generated Text [0.20971479389679337]
We adapt a recently proposed framework for statistical inference based on Generalized Dominance (GSD)<n>GSD addresses the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees.<n>By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences.
arXiv Detail & Related papers (2025-06-22T16:08:44Z)
Evaluating the Evaluation of Diversity in Commonsense Generation [28.654890118684957]
We conduct a systematic meta-evaluation of diversity metrics for commonsense generation.<n>We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets.<n>We show that content-based diversity evaluation metrics consistently outperform the form-based counterparts.
arXiv Detail & Related papers (2025-05-31T11:18:26Z)
Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds. Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models. These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z)
A Unifying Information-theoretic Perspective on Evaluating Generative Models [5.524685807042777]
Several recent approaches utilize "precision" and "recall," borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation) We unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. We propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE)
arXiv Detail & Related papers (2024-12-18T21:17:02Z)
Theoretical Aspects of Bias and Diversity in Minimum Bayes Risk Decoding [32.02732402635305]
Minimum Bayes Risk (MBR) decoding can mitigate this problem by utilizing automatic evaluation metrics and model-generated pseudo-references. We decompose errors in the estimated quality of generated hypotheses into two key factors: bias, which reflects the closeness between utility functions and human evaluations, and diversity, which represents the variation in the estimated quality of utility functions.
arXiv Detail & Related papers (2024-10-19T07:32:10Z)
A Uniform Concentration Inequality for Kernel-Based Two-Sample Statistics [4.757470449749877]
We show that these metrics can be unified under a general framework of kernel-based two-sample statistics. This paper establishes a novel uniform concentration inequality for the aforementioned kernel-based statistics. As illustrative applications, we demonstrate how these bounds facilitate the component of error bounds for procedures such as distance covariance-based dimension reduction.
arXiv Detail & Related papers (2024-05-22T22:41:56Z)
Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models [7.770029179741429]
We propose P-precision and P-recall (PP&PR), based on a probabilistic approach that address the problems. We show that our PP&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics.
arXiv Detail & Related papers (2023-09-04T13:19:17Z)
Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations. We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z)
On the Efficacy of Sampling Adapters [82.5941326570812]
We propose a unified framework for understanding sampling adapters. We argue that the shift they enforce can be viewed as a trade-off between precision and recall. We find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution.
arXiv Detail & Related papers (2023-07-07T17:59:12Z)
Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. We develop practical bounds to apply it to language generation. We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
A Unified Framework for Multi-distribution Density Ratio Estimation [101.67420298343512]
Binary density ratio estimation (DRE) provides the foundation for many state-of-the-art machine learning algorithms. We develop a general framework from the perspective of Bregman minimization divergence. We show that our framework leads to methods that strictly generalize their counterparts in binary DRE.
arXiv Detail & Related papers (2021-12-07T01:23:20Z)
On the Interpretability and Significance of Bias Metrics in Texts: a PMI-based Approach [3.2326259807823026]
We analyze an alternative PMI-based metric to quantify biases in texts. It can be expressed as a function of conditional probabilities, which provides a simple interpretation in terms of word co-occurrences.
arXiv Detail & Related papers (2021-04-13T19:34:17Z)
Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression [0.8574682463936005]
We propose a novel forest construction for multivariate responses based on their joint conditional distribution. The code is available as Python and R packages drf.
arXiv Detail & Related papers (2020-05-29T09:05:00Z)
Reliable Fidelity and Diversity Metrics for Generative Models [30.941563781926202]
The most widely used metric for measuring the similarity between real and generated images has been the Fr'echet Inception Distance (FID) score. We show that even the latest version of the precision and recall metrics are not reliable yet. We propose density and coverage metrics that solve the above issues.
arXiv Detail & Related papers (2020-02-23T00:50:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.