SelfEval: Leveraging the discriminative nature of generative models for evaluation
- URL: http://arxiv.org/abs/2311.10708v2
- Date: Wed, 27 Nov 2024 00:15:47 GMT
- Title: SelfEval: Leveraging the discriminative nature of generative models for evaluation
- Authors: Sai Saketh Rambhatla, Ishan Misra,
- Abstract summary: We present an automated way to evaluate the text alignment of text-to-image generative diffusion models.
Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts.
- Score: 30.239717220862143
- License:
- Abstract: We present an automated way to evaluate the text alignment of text-to-image generative diffusion models using standard image-text recognition datasets. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, and the likelihood can be used to perform recognition tasks with the generative model. We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. Existing automated metrics rely on an external pretrained model like CLIP (VLMs) or LLMs, and are sensitive to the exact pretrained model and its limitations. SelfEval sidesteps these issues, and to the best of our knowledge, is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics. SelfEval also reveals that generative models showcase competitive recognition performance on challenging tasks such as Winoground image-score compared to discriminative models. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.
Related papers
- Interactive Visual Assessment for Text-to-Image Generation Models [28.526897072724662]
We propose DyEval, a dynamic interactive visual assessment framework for generative models.
DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors.
Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems.
arXiv Detail & Related papers (2024-11-23T10:06:18Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models [23.09033991200197]
New personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles.
Such a lightweight solution poses a new concern regarding whether the personalized models are trained from unauthorized data.
We introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-14T12:29:23Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Class-Conditional self-reward mechanism for improved Text-to-Image models [1.8434042562191815]
We build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models.
This approach works by fine-tuning diffusion model on a self-generated self-judged dataset.
It has been evaluated to be at least 60% better than existing commercial and research Text-to-image models.
arXiv Detail & Related papers (2024-05-22T09:28:43Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction [2.3691158404002066]
We develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images.
We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images.
We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation.
arXiv Detail & Related papers (2023-12-05T23:33:49Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Exposing flaws of generative model evaluation metrics and their unfair
treatment of diffusion models [14.330863905963442]
We compare 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models.
We find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID.
Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet.
arXiv Detail & Related papers (2023-06-07T18:00:00Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.