SelfEval: Leveraging the discriminative nature of generative models for
evaluation
- URL: http://arxiv.org/abs/2311.10708v1
- Date: Fri, 17 Nov 2023 18:58:16 GMT
- Title: SelfEval: Leveraging the discriminative nature of generative models for
evaluation
- Authors: Sai Saketh Rambhatla, Ishan Misra
- Abstract summary: We show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities.
Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts.
- Score: 35.7242199928684
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we show that text-to-image generative models can be 'inverted'
to assess their own text-image understanding capabilities in a completely
automated manner.
Our method, called SelfEval, uses the generative model to compute the
likelihood of real images given text prompts, making the generative model
directly applicable to discriminative tasks.
Using SelfEval, we repurpose standard datasets created for evaluating
multimodal text-image discriminative models to evaluate generative models in a
fine-grained manner: assessing their performance on attribute binding, color
recognition, counting, shape recognition, spatial understanding.
To the best of our knowledge SelfEval is the first automated metric to show a
high degree of agreement for measuring text-faithfulness with the gold-standard
human evaluations across multiple models and benchmarks.
Moreover, SelfEval enables us to evaluate generative models on challenging
tasks such as Winoground image-score where they demonstrate competitive
performance to discriminative models.
We also show severe drawbacks of standard automated metrics such as
CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and
how SelfEval sidesteps these issues.
We hope SelfEval enables easy and reliable automated evaluation for diffusion
models.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - OMNIINPUT: A Model-centric Evaluation Framework through Output
Distribution [31.00645110294068]
We propose a model-centric evaluation framework, OmniInput, to evaluate the quality of an AI/ML model's predictions on all possible inputs.
We employ an efficient sampler to obtain representative inputs and the output distribution of the trained model.
Our experiments demonstrate that OmniInput enables a more fine-grained comparison between models.
arXiv Detail & Related papers (2023-12-06T04:53:12Z) - FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction [2.3691158404002066]
We develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images.
We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images.
We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation.
arXiv Detail & Related papers (2023-12-05T23:33:49Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks.
GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - Residual Energy-Based Models for Text [46.22375671394882]
We show that the generations of auto-regressive language models can be reliably distinguished from real text by statistical discriminators.
This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process.
arXiv Detail & Related papers (2020-04-06T13:44:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.