Related papers: Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

URL: http://arxiv.org/abs/2508.11452v2
Date: Tue, 02 Sep 2025 08:20:59 GMT
Title: Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps
Authors: Kangyu Wang, Hongliang He, Lin Liu, Ruiqi Liang, Zhenzhong Lan, Jianguo Li,
Abstract summary: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities.<n>We present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from applications.
Score: 33.86371712677534
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement Matches, a cold-start mechanism to quickly estimate initial ratings for newly integrated models, and (2) Proximity Sampling, an intelligent comparison strategy that prioritizes battles between models of similar capabilities to maximize information gain and enhance rating stability. Extensive empirical analyses and simulations demonstrate that Inclusion Arena yields reliable and stable rankings, exhibits higher data transitivity compared to general crowdsourced datasets, and significantly mitigates the risk of malicious manipulation. By fostering an open alliance between foundation models and real-world applications, Inclusion Arena aims to accelerate the development of LLMs and MLLMs truly optimized for practical, user-centric deployments. The platform is publicly accessible at https://www.tbox.cn/about/model-ranking.

Related papers

Reinforcement World Model Learning for LLM-based Agents [60.65003139516272]
Reinforcement World Model Learning (RWML) is a self-conditioned method that learns action-supervised world models for LLM-based agents.<n>Our method aligns simulated next states produced by the model with realized next states observed from the environment.<n>We evaluate our method on ALFWorld and $2$ Bench and observe significant gains over the base model, despite being entirely self-supervised.
arXiv Detail & Related papers (2026-02-05T16:30:08Z)
DAMBench: A Multi-Modal Benchmark for Deep Learning-based Atmospheric Data Assimilation [14.776071715723262]
We introduce DAMBench, the first large-scale multi-modal benchmark to evaluate data-driven DA models under realistic atmospheric conditions.<n> DAMBench integrates high-quality background states from state-of-the-art forecasting systems and real-world multi-modal observations.<n>We provide unified evaluation protocols and benchmark representative data assimilation approaches.
arXiv Detail & Related papers (2025-11-03T11:26:26Z)
SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model [88.04128601981145]
We introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning.<n>modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation.<n>World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning.
arXiv Detail & Related papers (2025-07-31T17:57:20Z)
Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective [54.77404771454794]
We develop a flexible and robust world model for Multi-Agent Reinforcement Learning (MARL) using diffusion models.<n>Our method, Diffusion-Inspired Multi-Agent world model (DIMA), achieves state-of-the-art performance across multiple multi-agent control benchmarks.
arXiv Detail & Related papers (2025-05-27T09:11:38Z)
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models [66.51871176061195]
Decentralized Arena (dearena) is a fully automated framework leveraging collective intelligence from all large language models to evaluate each other.<n> dearena attains up to 97% correlation with human judgements, while significantly reducing the cost.
arXiv Detail & Related papers (2025-05-19T07:34:25Z)
A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.<n>First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.<n>Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.<n>Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z)
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a new sandbox suite tailored for integrated data-model co-development.<n>This sandbox provides a feedback-driven experimental platform, enabling cost-effective and guided refinement of both data and models.
arXiv Detail & Related papers (2024-07-16T14:40:07Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Model Share AI: An Integrated Toolkit for Collaborative Machine Learning Model Development, Provenance Tracking, and Deployment in Python [0.0]
We introduce Model Share AI (AIMS), an easy-to-use MLOps platform designed to streamline collaborative model development, model provenance tracking, and model deployment. AIMS features collaborative project spaces and a standardized model evaluation process that ranks model submissions based on their performance on unseen evaluation data. AIMS allows users to deploy ML models built in Scikit-Learn, Keras, PyTorch, and ONNX into live REST APIs and automatically generated web apps.
arXiv Detail & Related papers (2023-09-27T15:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.