Related papers: SelfEval: Leveraging the discriminative nature of generative models for evaluation

SelfEval: Leveraging the discriminative nature of generative models for evaluation

URL: http://arxiv.org/abs/2311.10708v1
Date: Fri, 17 Nov 2023 18:58:16 GMT
Title: SelfEval: Leveraging the discriminative nature of generative models for evaluation
Authors: Sai Saketh Rambhatla, Ishan Misra
Abstract summary: We show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts.
Score: 35.7242199928684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

Related papers

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation [55.42794740244581]
We propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model.<n> Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt.<n>Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback.
arXiv Detail & Related papers (2025-05-22T15:05:07Z)
Interactive Visual Assessment for Text-to-Image Generation Models [28.526897072724662]
We propose DyEval, a dynamic interactive visual assessment framework for generative models. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems.
arXiv Detail & Related papers (2024-11-23T10:06:18Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse. We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space. Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models [23.09033991200197]
New personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles. Such a lightweight solution poses a new concern regarding whether the personalized models are trained from unauthorized data. We introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-14T12:29:23Z)
Class-Conditional self-reward mechanism for improved Text-to-Image models [1.8434042562191815]
We build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models. This approach works by fine-tuning diffusion model on a self-generated self-judged dataset. It has been evaluated to be at least 60% better than existing commercial and research Text-to-image models.
arXiv Detail & Related papers (2024-05-22T09:28:43Z)
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z)
Verifiable evaluations of machine learning models using zkSNARKs [40.538081946945596]
This work presents a method of verifiable model evaluation using model inference through zkSNARKs. The resulting zero-knowledge computational proofs of model outputs over datasets can be packaged into verifiable evaluation attestations. For the first time, we demonstrate this across a sample of real-world models and highlight key challenges and design solutions.
arXiv Detail & Related papers (2024-02-05T02:21:11Z)
FERGI: Automatic Annotation of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction [2.3691158404002066]
We develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation.
arXiv Detail & Related papers (2023-12-05T23:33:49Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development. Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models [14.330863905963442]
We compare 17 modern metrics for evaluating the overall performance, fidelity, diversity, rarity, and memorization of generative models. We find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet.
arXiv Detail & Related papers (2023-06-07T18:00:00Z)
GMValuator: Similarity-based Data Valuation for Generative Models [41.76259565672285]
We introduce Generative Model Valuator (GMValuator), the first training-free and model-agnostic approach to provide data valuation for generation tasks. GMValuator is extensively evaluated on various datasets and generative architectures to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-21T02:02:02Z)
Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs. This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
Residual Energy-Based Models for Text [46.22375671394882]
We show that the generations of auto-regressive language models can be reliably distinguished from real text by statistical discriminators. This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process.
arXiv Detail & Related papers (2020-04-06T13:44:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.