GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
- URL: http://arxiv.org/abs/2602.06013v1
- Date: Thu, 05 Feb 2026 18:52:48 GMT
- Title: GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
- Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang,
- Abstract summary: We introduce a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation.<n>Our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard.
- Score: 29.804627410258732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
Related papers
- MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation [48.84450712826316]
MSVBench is the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation.<n>We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models with the fine-grained perceptual rigor of domain-specific expert models.
arXiv Detail & Related papers (2026-02-27T12:26:34Z) - VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis [15.922599086027098]
Novel View Synthesis (NVS) aims to generate realistic images of a given content from unseen viewpoints.<n>Existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view.<n>We introduce two complementary evaluation metrics: a reference-based score, $D_textPRISM$, and a reference-free score, $textMMD_textPRISM$.
arXiv Detail & Related papers (2025-11-16T16:28:08Z) - Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark [55.41250396114216]
We review human evaluation practices in automated, speech-driven 3D gesture generation.<n>We introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset.
arXiv Detail & Related papers (2025-11-03T05:17:28Z) - ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z) - Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics [6.708543240320757]
This paper presents a detailed review of eight evaluation metrics for human motion generation.
We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons.
We introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity.
arXiv Detail & Related papers (2024-05-13T12:10:57Z) - ImagenHub: Standardizing the evaluation of conditional image generation
models [48.51117156168]
This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all conditional image generation models.
We design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images.
Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4.
arXiv Detail & Related papers (2023-10-02T19:41:42Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.