K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
- URL: http://arxiv.org/abs/2408.14468v1
- Date: Mon, 26 Aug 2024 17:58:20 GMT
- Title: K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
- Authors: Zhikai Li, Xuewen Liu, Dongrong Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, Zhen Dong,
- Abstract summary: Arena platform, which gathers user votes on model comparisons, can rank models with human preferences.
We introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts.
In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm.
- Score: 30.744662265421788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena
Related papers
- A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.
First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.
Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.
Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z) - Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive.
We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations.
It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z) - Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models [0.29687381456164]
VARCO Arena is a novel, cost-effective, and robust benchmarking approach for large language models.
Our results demonstrate that VARCO Arena not only produces reliable LLM rankings but also provides a scalable, adaptable solution for qualitative evaluation.
arXiv Detail & Related papers (2024-11-02T15:23:28Z) - Curriculum Direct Preference Optimization for Diffusion and Consistency Models [110.08057135882356]
We propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation.
Our approach, Curriculum DPO, is compared against state-of-the-art fine-tuning approaches on three benchmarks.
arXiv Detail & Related papers (2024-05-22T13:36:48Z) - A State-Space Perspective on Modelling and Inference for Online Skill Rating [1.9253333342733674]
We introduce new approaches based on sequential Monte Carlo and discrete hidden Markov models.
We advocate for a state-space model perspective, wherein players' skills are represented as time-varying, and match results serve as observed quantities.
We examine the challenges of scaling up to numerous players and matches, highlighting the main approximations and reductions.
arXiv Detail & Related papers (2023-08-04T16:03:50Z) - Ranking with Confidence for Large Scale Comparison Data [2.486161976966064]
In this work, we leverage a generative data model considering comparison noise to develop a fast, precise, and informative ranking from pairwise comparisons.
In real data, PD-Rank requires less computational time to achieve the same Kendall algorithm than active learning methods.
arXiv Detail & Related papers (2022-02-03T16:36:37Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Active Sampling for Pairwise Comparisons via Approximate Message Passing
and Information Gain Maximization [5.771869590520189]
We propose ASAP, an active sampling algorithm based on approximate message passing and expected information gain.
We show that ASAP offers the highest accuracy of inferred scores compared to the existing methods.
arXiv Detail & Related papers (2020-04-12T20:48:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.