Related papers: Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

URL: http://arxiv.org/abs/2301.04122v3
Date: Tue, 11 Apr 2023 18:16:38 GMT
Title: Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks
Authors: Mingyu Liang, Wenyin Fu, Louis Feng, Zhongyi Lin, Pavani Panakanti, Shengbao Zheng, Srinivas Sridharan, Christina Delimitrou
Abstract summary: Mystique is an accurate and scalable framework for production AI benchmark generation. Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
Score: 2.0315147707806283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks. To overcome these issues, we propose Mystique, an accurate and scalable framework for production AI benchmark generation. It leverages the PyTorch execution trace (ET), a new feature that captures the runtime information of AI models at the granularity of operators, in a graph format, together with their metadata. By sourcing fleet ETs, we can build AI benchmarks that are portable and representative. Mystique is scalable, due to its lightweight data collection, in terms of runtime overhead and instrumentation effort. It is also adaptive because ET composability allows flexible control on benchmark creation. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics. We also showcase the portability of the generated benchmarks across platforms, and demonstrate several use cases enabled by the fine-grained composability of the execution trace.

Related papers

TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents [17.296425855109426]
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents.<n>TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks.<n>We implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models.
arXiv Detail & Related papers (2025-05-19T16:11:23Z)
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents [0.0]
OSUniverse is a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents.<n>We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent.<n>The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%.
arXiv Detail & Related papers (2025-05-06T14:29:47Z)
General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems. We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z)
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
A Benchmark Generative Probabilistic Model for Weak Supervised Learning [2.0257616108612373]
Weak Supervised Learning approaches have been developed to alleviate the annotation burden. We show that latent variable models (PLVMs) achieve state-of-the-art performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:06:24Z)
MONAI Label: A framework for AI-assisted Interactive Labeling of 3D Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models. We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
AIPerf: Automated machine learning as an AI-HPC benchmark [17.57686674304368]
We propose an end-to-end benchmark suite utilizing automated machine learning (AutoML) We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems. With flexible workload and single metric, our benchmark can scale and rank AI- HPC easily.
arXiv Detail & Related papers (2020-08-17T08:06:43Z)
AIBench Training: Balanced Industry-Standard AI Training Benchmarking [26.820244556465333]
Earlier-stage evaluations of a new AI architecture/system need affordable benchmarks. We use real-world benchmarks to cover the factors space that impacts the learning dynamics. We contribute by far the most comprehensive AI training benchmark suite.
arXiv Detail & Related papers (2020-04-30T11:08:49Z)
AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite [26.820244556465333]
This paper proposes an agile domain-specific benchmarking methodology. We identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We present the first end-to-end Internet service AI benchmark.
arXiv Detail & Related papers (2020-02-17T07:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.