Related papers: AIBench Scenario: Scenario-distilling AI Benchmarking

AIBench Scenario: Scenario-distilling AI Benchmarking

URL: http://arxiv.org/abs/2005.03459v4
Date: Sun, 5 Sep 2021 13:44:38 GMT
Title: AIBench Scenario: Scenario-distilling AI Benchmarking
Authors: Wanling Gao, Fei Tang, Jianfeng Zhan, Xu Wen, Lei Wang, Zheng Cao, Chuanxin Lan, Chunjie Luo, Xiaoli Liu, Zihan Jiang
Abstract summary: We formalize a real-world application scenario as a Directed Acyclic Graph-based model. We propose the rules to distill it into a permutation of essential AI and non-AI tasks, which we call a scenario benchmark. We implement two Internet service AI scenario benchmarks based on the framework as proxies to two real-world application scenarios.
Score: 8.909947747424672
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern real-world application scenarios like Internet services consist of a diversity of AI and non-AI modules with huge code sizes and long and complicated execution paths, which raises serious benchmarking or evaluating challenges. Using AI components or micro benchmarks alone can lead to error-prone conclusions. This paper presents a methodology to attack the above challenge. We formalize a real-world application scenario as a Directed Acyclic Graph-based model and propose the rules to distill it into a permutation of essential AI and non-AI tasks, which we call a scenario benchmark. Together with seventeen industry partners, we extract nine typical scenario benchmarks. We design and implement an extensible, configurable, and flexible benchmark framework. We implement two Internet service AI scenario benchmarks based on the framework as proxies to two real-world application scenarios. We consider scenario, component, and micro benchmarks as three indispensable parts for evaluating. Our evaluation shows the advantage of our methodology against using component or micro AI benchmarks alone. The specifications, source code, testbed, and results are publicly available from \url{https://www.benchcouncil.org/aibench/scenario/}.

Related papers

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks [52.47895046206854]
FieldWorkArena is a benchmark for agentic AI targeting real-world field work.<n>This paper defines a new action space that agentic AI should possess for real world work environment benchmarks.
arXiv Detail & Related papers (2025-05-26T08:21:46Z)
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents [17.296425855109426]
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents.<n>TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks.<n>We implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models.
arXiv Detail & Related papers (2025-05-19T16:11:23Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems. We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
Text2Scenario: Text-Driven Scenario Generation for Autonomous Driving Test [15.601818101020996]
Text2Scenario is a framework that autonomously generates simulation test scenarios that closely align with user specifications. Result is an efficient and precise evaluation of diverse AD stacks void of the labor-intensive need for manual scenario configuration.
arXiv Detail & Related papers (2025-03-04T07:20:25Z)
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results [55.33807002543901]
We present AIvaluateXR, a comprehensive evaluation framework for benchmarking large language models (LLMs) running on XR devices.<n>We deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation.<n>We propose a unified evaluation method based on the 3D Optimality theory to select the optimal device-model pairs from quality and speed objectives.
arXiv Detail & Related papers (2025-02-13T20:55:48Z)
Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [54.93461228053298]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline. We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature. We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
UniTTA: Unified Benchmark and Versatile Framework Towards Realistic Test-Time Adaptation [66.05528698010697]
Test-Time Adaptation aims to adapt pre-trained models to the target domain during testing. Researchers have identified various challenging scenarios and developed diverse methods to address these challenges. We propose a Unified Test-Time Adaptation benchmark, which is comprehensive and widely applicable.
arXiv Detail & Related papers (2024-07-29T15:04:53Z)
Alice Benchmarks: Connecting Real World Re-Identification with the Synthetic [92.02220105679713]
We introduce the Alice benchmarks, large-scale datasets providing benchmarks and evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario.
arXiv Detail & Related papers (2023-10-06T17:58:26Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization. We propose Unified Multi-scenario Summarization Evaluation Model (UMSE) Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z)
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z)
Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks [2.0315147707806283]
Mystique is an accurate and scalable framework for production AI benchmark generation. Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
arXiv Detail & Related papers (2022-12-16T18:46:37Z)
From Multi-agent to Multi-robot: A Scalable Training and Evaluation Platform for Multi-robot Reinforcement Learning [12.74238738538799]
Multi-agent reinforcement learning (MARL) has been gaining extensive attention from academia and industries in the past few decades. It remains unknown how these methods perform in real-world scenarios, especially multi-robot systems. This paper introduces a scalable emulation platform for multi-robot reinforcement learning (MRRL) called SMART to meet this need.
arXiv Detail & Related papers (2022-06-20T06:36:45Z)
Addressing the IEEE AV Test Challenge with Scenic and VerifAI [10.221093591444731]
This paper summarizes our formal approach to testing autonomous vehicles (AVs) in simulation for the IEEE AV Test Challenge. We demonstrate a systematic testing framework leveraging our previous work on formally-driven simulation for intelligent cyber-physical systems.
arXiv Detail & Related papers (2021-08-20T04:51:27Z)
AIBench Training: Balanced Industry-Standard AI Training Benchmarking [26.820244556465333]
Earlier-stage evaluations of a new AI architecture/system need affordable benchmarks. We use real-world benchmarks to cover the factors space that impacts the learning dynamics. We contribute by far the most comprehensive AI training benchmark suite.
arXiv Detail & Related papers (2020-04-30T11:08:49Z)
Learning Meta Face Recognition in Unseen Domains [74.69681594452125]
We propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR) MFR synthesizes the source/target domain shift with a meta-optimization objective. We propose two benchmarks for generalized face recognition evaluation.
arXiv Detail & Related papers (2020-03-17T14:10:30Z)
AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite [26.820244556465333]
This paper proposes an agile domain-specific benchmarking methodology. We identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We present the first end-to-end Internet service AI benchmark.
arXiv Detail & Related papers (2020-02-17T07:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.