Operationalizing Specifications, In Addition to Test Sets for Evaluating
Constrained Generative Models
- URL: http://arxiv.org/abs/2212.00006v1
- Date: Sat, 19 Nov 2022 06:39:43 GMT
- Title: Operationalizing Specifications, In Addition to Test Sets for Evaluating
Constrained Generative Models
- Authors: Vikas Raunak, Matt Post and Arul Menezes
- Abstract summary: We argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted.
Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality.
- Score: 17.914521288548844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present some recommendations on the evaluation of
state-of-the-art generative models for constrained generation tasks. The
progress on generative models has been rapid in recent years. These large-scale
models have had three impacts: firstly, the fluency of generation in both
language and vision modalities has rendered common average-case evaluation
metrics much less useful in diagnosing system errors. Secondly, the same
substrate models now form the basis of a number of applications, driven both by
the utility of their representations as well as phenomena such as in-context
learning, which raise the abstraction level of interacting with such models.
Thirdly, the user expectations around these models and their feted public
releases have made the technical challenge of out of domain generalization much
less excusable in practice. Subsequently, our evaluation methodologies haven't
adapted to these changes. More concretely, while the associated utility and
methods of interacting with generative models have expanded, a similar
expansion has not been observed in their evaluation practices. In this paper,
we argue that the scale of generative models could be exploited to raise the
abstraction level at which evaluation itself is conducted and provide
recommendations for the same. Our recommendations are based on leveraging
specifications as a powerful instrument to evaluate generation quality and are
readily applicable to a variety of tasks.
Related papers
- Embedding-based statistical inference on generative models [10.948308354932639]
We extend results related to embedding-based representations of generative models to classical statistical inference settings.
We demonstrate that using the perspective space as the basis of a notion of "similar" is effective for multiple model-level inference tasks.
arXiv Detail & Related papers (2024-10-01T22:28:39Z) - Eureka: Evaluating and Understanding Large Foundation Models [23.020996995362104]
We present Eureka, an open-source framework for standardizing evaluations of large foundation models beyond single-score reporting and rankings.
We conduct an analysis of 12 state-of-the-art models, providing in-depth insights into failure understanding and model comparison.
arXiv Detail & Related papers (2024-09-13T18:01:49Z) - PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis [14.526536510805755]
We present a comprehensive framework for predicting the effects of perturbations in single cells, designed to standardize benchmarking in this rapidly evolving field.
Our framework, PerturBench, includes a user-friendly platform, diverse datasets, metrics for fair model comparison, and detailed performance analysis.
arXiv Detail & Related papers (2024-08-20T07:40:20Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks.
GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - Beyond Average Performance -- exploring regions of deviating performance
for black box classification models [0.0]
We describe two approaches that can be used to provide interpretable descriptions of the expected performance of any black box classification model.
These approaches are of high practical relevance as they provide means to uncover and describe in an interpretable way situations where the models are expected to have a performance that deviates significantly from their average behaviour.
arXiv Detail & Related papers (2021-09-16T20:46:52Z) - On the model-based stochastic value gradient for continuous
reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward.
Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.