Mystique: Enabling Accurate and Scalable Generation of Production AI
Benchmarks
- URL: http://arxiv.org/abs/2301.04122v3
- Date: Tue, 11 Apr 2023 18:16:38 GMT
- Title: Mystique: Enabling Accurate and Scalable Generation of Production AI
Benchmarks
- Authors: Mingyu Liang, Wenyin Fu, Louis Feng, Zhongyi Lin, Pavani Panakanti,
Shengbao Zheng, Srinivas Sridharan, Christina Delimitrou
- Abstract summary: Mystique is an accurate and scalable framework for production AI benchmark generation.
Mystique is scalable, due to its lightweight data collection, in terms of overhead runtime and instrumentation effort.
We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models.
- Score: 2.0315147707806283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building large AI fleets to support the rapidly growing DL workloads is an
active research topic for modern cloud providers. Generating accurate
benchmarks plays an essential role in designing the fast-paced software and
hardware solutions in this space. Two fundamental challenges to make this
scalable are (i) workload representativeness and (ii) the ability to quickly
incorporate changes to the fleet into the benchmarks.
To overcome these issues, we propose Mystique, an accurate and scalable
framework for production AI benchmark generation. It leverages the PyTorch
execution trace (ET), a new feature that captures the runtime information of AI
models at the granularity of operators, in a graph format, together with their
metadata. By sourcing fleet ETs, we can build AI benchmarks that are portable
and representative. Mystique is scalable, due to its lightweight data
collection, in terms of runtime overhead and instrumentation effort. It is also
adaptive because ET composability allows flexible control on benchmark
creation.
We evaluate our methodology on several production AI models, and show that
benchmarks generated with Mystique closely resemble original AI models, both in
execution time and system-level metrics. We also showcase the portability of
the generated benchmarks across platforms, and demonstrate several use cases
enabled by the fine-grained composability of the execution trace.
Related papers
- BENCHAGENTS: Automated Benchmark Creation with Agent Interaction [16.4783894348333]
We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities.
We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation.
We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
arXiv Detail & Related papers (2024-10-29T22:56:18Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - A Benchmark Generative Probabilistic Model for Weak Supervised Learning [2.0257616108612373]
Weak Supervised Learning approaches have been developed to alleviate the annotation burden.
We show that latent variable models (PLVMs) achieve state-of-the-art performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:06:24Z) - MONAI Label: A framework for AI-assisted Interactive Labeling of 3D
Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models.
We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Dynabench: Rethinking Benchmarking in NLP [82.26699038776812]
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
arXiv Detail & Related papers (2021-04-07T17:49:17Z) - AIPerf: Automated machine learning as an AI-HPC benchmark [17.57686674304368]
We propose an end-to-end benchmark suite utilizing automated machine learning (AutoML)
We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems.
With flexible workload and single metric, our benchmark can scale and rank AI- HPC easily.
arXiv Detail & Related papers (2020-08-17T08:06:43Z) - AIBench Training: Balanced Industry-Standard AI Training Benchmarking [26.820244556465333]
Earlier-stage evaluations of a new AI architecture/system need affordable benchmarks.
We use real-world benchmarks to cover the factors space that impacts the learning dynamics.
We contribute by far the most comprehensive AI training benchmark suite.
arXiv Detail & Related papers (2020-04-30T11:08:49Z) - AIBench: An Agile Domain-specific Benchmarking Methodology and an AI
Benchmark Suite [26.820244556465333]
This paper proposes an agile domain-specific benchmarking methodology.
We identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks.
We present the first end-to-end Internet service AI benchmark.
arXiv Detail & Related papers (2020-02-17T07:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.