Operationalizing Specifications, In Addition to Test Sets for Evaluating
Constrained Generative Models
- URL: http://arxiv.org/abs/2212.00006v1
- Date: Sat, 19 Nov 2022 06:39:43 GMT
- Title: Operationalizing Specifications, In Addition to Test Sets for Evaluating
Constrained Generative Models
- Authors: Vikas Raunak, Matt Post and Arul Menezes
- Abstract summary: We argue that the scale of generative models could be exploited to raise the abstraction level at which evaluation itself is conducted.
Our recommendations are based on leveraging specifications as a powerful instrument to evaluate generation quality.
- Score: 17.914521288548844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present some recommendations on the evaluation of
state-of-the-art generative models for constrained generation tasks. The
progress on generative models has been rapid in recent years. These large-scale
models have had three impacts: firstly, the fluency of generation in both
language and vision modalities has rendered common average-case evaluation
metrics much less useful in diagnosing system errors. Secondly, the same
substrate models now form the basis of a number of applications, driven both by
the utility of their representations as well as phenomena such as in-context
learning, which raise the abstraction level of interacting with such models.
Thirdly, the user expectations around these models and their feted public
releases have made the technical challenge of out of domain generalization much
less excusable in practice. Subsequently, our evaluation methodologies haven't
adapted to these changes. More concretely, while the associated utility and
methods of interacting with generative models have expanded, a similar
expansion has not been observed in their evaluation practices. In this paper,
we argue that the scale of generative models could be exploited to raise the
abstraction level at which evaluation itself is conducted and provide
recommendations for the same. Our recommendations are based on leveraging
specifications as a powerful instrument to evaluate generation quality and are
readily applicable to a variety of tasks.
Related papers
- OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - When is an Embedding Model More Promising than Another? [33.540506562970776]
Embedders play a central role in machine learning, projecting any object into numerical representations that can be leveraged to perform various downstream tasks.
The evaluation of embedding models typically depends on domain-specific empirical approaches.
We present a unified approach to evaluate embedders, drawing upon the concepts of sufficiency and informativeness.
arXiv Detail & Related papers (2024-06-11T18:13:46Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks.
GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z) - Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models.
We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - Beyond Average Performance -- exploring regions of deviating performance
for black box classification models [0.0]
We describe two approaches that can be used to provide interpretable descriptions of the expected performance of any black box classification model.
These approaches are of high practical relevance as they provide means to uncover and describe in an interpretable way situations where the models are expected to have a performance that deviates significantly from their average behaviour.
arXiv Detail & Related papers (2021-09-16T20:46:52Z) - On the model-based stochastic value gradient for continuous
reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward.
Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.