Related papers: Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

URL: http://arxiv.org/abs/2409.11904v2
Date: Tue, 15 Oct 2024 14:23:46 GMT
Title: Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation
Authors: Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen,
Abstract summary: Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models on style preference, coherence, and text-to-image alignment. We show that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

Related papers

Towards Evaluating Robustness of Prompt Adherence in Text to Image Models [0.0]
This paper aims to establish a comprehensive evaluation framework for Text-to-Image models.<n>We created a novel dataset that aimed to assess the robustness of these models in generating images that conform to the specified factors of variation in the input text prompts.<n>Our results reveal that these models struggle to create simple binary images with only two factors of variation: a simple geometric shape and its location.
arXiv Detail & Related papers (2025-07-09T18:40:17Z)
Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help [18.70937620674227]
We introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our evaluations reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases.
arXiv Detail & Related papers (2025-03-10T03:28:18Z)
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive. We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Benchmark Granularity and Model Robustness for Image-Text Retrieval [44.045767657945895]
We show how dataset granularity and query perturbations affect retrieval performance and robustness.<n>We show that richer captions consistently enhance retrieval, especially in text-to-image tasks.<n>Our results highlight variation in model robustness and a dataset-dependent relationship between caption granularity and sensitivity perturbation.
arXiv Detail & Related papers (2024-07-21T18:08:44Z)
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models [16.18275805302776]
We propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. EvalAlign aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
arXiv Detail & Related papers (2024-06-24T11:56:15Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models. We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z)
TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models [22.076898042211305]
We propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt. Our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts.
arXiv Detail & Related papers (2023-12-03T02:31:37Z)
Holistic Evaluation of Text-To-Image Models [153.47415461488097]
We introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM) We identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths.
arXiv Detail & Related papers (2023-11-07T19:00:56Z)
Exploring the Robustness of Human Parsers Towards Common Corruptions [99.89886010550836]
We construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions.
arXiv Detail & Related papers (2023-09-02T13:32:14Z)
MiVOLO: Multi-input Transformer for Age and Gender Estimation [0.0]
We present MiVOLO, a straightforward approach for age and gender estimation using the latest vision transformer. Our method integrates both tasks into a unified dual input/output model. We compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges.
arXiv Detail & Related papers (2023-07-10T14:58:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.