GenAI Arena: An Open Evaluation Platform for Generative Models
- URL: http://arxiv.org/abs/2406.04485v4
- Date: Mon, 11 Nov 2024 06:32:24 GMT
- Title: GenAI Arena: An Open Evaluation Platform for Generative Models
- Authors: Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen,
- Abstract summary: This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models.
GenAI-Arena aims to provide a more democratic and accurate measure of model performance.
It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively.
- Score: 33.246432399321826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 35 open-source generative models. GenAI-Arena has been operating for seven months, amassing over 9000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, and GPT-4o to mimic human voting. We compute the accuracy by comparing the model voting with the human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves an average accuracy of 49.19 across the three generative tasks. Open-source MLLMs perform even worse due to the lack of instruction-following and reasoning ability in complex vision scenarios.
Related papers
- 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models [94.48803082248872]
3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace.
We develop a large-scale human preference dataset 3DGen-Bench.
We then train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval.
arXiv Detail & Related papers (2025-03-27T17:53:00Z) - Exploring Bias in over 100 Text-to-Image Generative Models [49.60774626839712]
We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face.
We assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate.
Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased.
arXiv Detail & Related papers (2025-03-11T03:40:44Z) - GenAI vs. Human Fact-Checkers: Accurate Ratings, Flawed Rationales [2.3475022003300055]
GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders.
We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy.
arXiv Detail & Related papers (2025-02-20T17:47:40Z) - Benchmarking Generative AI Models for Deep Learning Test Input Generation [6.674615464230326]
Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets.
Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images.
We benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images.
arXiv Detail & Related papers (2024-12-23T15:30:42Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models [146.18107944503436]
Molmo is a new family of VLMs that are state-of-the-art in their class of openness.
Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators.
We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future.
arXiv Detail & Related papers (2024-09-25T17:59:51Z) - Data Augmentation for Image Classification using Generative AI [8.74488498507946]
Data augmentation is a promising solution to expanding the dataset size.
Recent approaches use generative AI models to improve dataset diversity.
We propose the Automated Generative Data Augmentation (AGA)
arXiv Detail & Related papers (2024-08-31T21:16:43Z) - Enhanced Infield Agriculture with Interpretable Machine Learning Approaches for Crop Classification [0.49110747024865004]
This research evaluates four different approaches for crop classification, namely traditional ML with handcrafted feature extraction methods like SIFT, ORB, and Color Histogram; Custom Designed CNN and established DL architecture like AlexNet; transfer learning on five models pre-trained using ImageNet.
Xception outperformed all of them in terms of generalization, achieving 98% accuracy on the test data, with a model size of 80.03 MB and a prediction time of 0.0633 seconds.
arXiv Detail & Related papers (2024-08-22T14:20:34Z) - Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations [52.11801730860999]
In recent years, the robot learning community has shown increasing interest in using deep generative models to capture the complexity of large datasets.
We present the different types of models that the community has explored, such as energy-based models, diffusion models, action value maps, or generative adversarial networks.
We also present the different types of applications in which deep generative models have been used, from grasp generation to trajectory generation or cost learning.
arXiv Detail & Related papers (2024-08-08T11:34:31Z) - GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation [103.3465421081531]
VQAScore is a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt.
Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward.
We release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt.
arXiv Detail & Related papers (2024-06-19T18:00:07Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - MiVOLO: Multi-input Transformer for Age and Gender Estimation [0.0]
We present MiVOLO, a straightforward approach for age and gender estimation using the latest vision transformer.
Our method integrates both tasks into a unified dual input/output model.
We compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges.
arXiv Detail & Related papers (2023-07-10T14:58:10Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.