Related papers: A Study on the Evaluation of Generative Models

A Study on the Evaluation of Generative Models

URL: http://arxiv.org/abs/2206.10935v1
Date: Wed, 22 Jun 2022 09:27:31 GMT
Title: A Study on the Evaluation of Generative Models
Authors: Eyal Betzalel, Coby Penso, Aviv Navon, Ethan Fetaya
Abstract summary: Implicit generative models, which do not return likelihood values, have become prevalent in recent years. In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
Score: 19.18642459565609
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Implicit generative models, which do not return likelihood values, such as generative adversarial networks and diffusion models, have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the Inception score (IS) and Frechet Inception Distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably making them problematic when used for fain-grained comparison. We further used this experimental setting to study which evaluation metric best correlates with our probabilistic metrics. Lastly, we look into the base features used for metrics such as FID.

Related papers

Position: All Current Generative Fidelity and Diversity Metrics are Flawed [58.815519650465774]
We show that all current generative fidelity and diversity metrics are flawed.<n>Our aim is to convince the research community to spend more effort in developing metrics, instead of models.
arXiv Detail & Related papers (2025-05-28T15:10:33Z)
Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs. We propose a new evaluation methodology that accounts for the groundedness of predictions. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Scaling Laws Do Not Scale [54.72120385955072]
Recent work has argued that as the size of a dataset increases, the performance of a model trained on that dataset will increase. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations.
arXiv Detail & Related papers (2023-07-05T15:32:21Z)
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models [14.330863905963442]
We compare 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models. We find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet.
arXiv Detail & Related papers (2023-06-07T18:00:00Z)
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature. We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA) We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z)
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples [25.657798631897908]
Feature Likelihood Divergence provides a comprehensive trichotomic evaluation of generative models. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail.
arXiv Detail & Related papers (2023-02-09T04:57:27Z)
Exploring validation metrics for offline model-based optimisation with diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)
Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models. Standard metrics calculated from retrospective data are only related to model utility under certain assumptions. When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
Evaluation Metrics for Conditional Image Generation [100.69766435176557]
We present two new metrics for evaluating generative models in the class-conditional image generation setting. A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts. We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models.
arXiv Detail & Related papers (2020-04-26T12:15:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.