MLHarness: A Scalable Benchmarking System for MLCommons
- URL: http://arxiv.org/abs/2111.05231v1
- Date: Tue, 9 Nov 2021 16:11:49 GMT
- Title: MLHarness: A Scalable Benchmarking System for MLCommons
- Authors: Yen-Hsiang Chang, Jianhao Pu, Wen-mei Hwu, Jinjun Xiong
- Abstract summary: We propose a scalable benchmarking harness system for MLCommons Inference.
It codifies the standard benchmark process as defined by MLCommons Inference.
It provides an easy and declarative approach for model developers to contribute their models and datasets to MLCommons Inference.
- Score: 16.490366217665205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the society's growing adoption of machine learning (ML) and deep
learning (DL) for various intelligent solutions, it becomes increasingly
imperative to standardize a common set of measures for ML/DL models with large
scale open datasets under common development practices and resources so that
people can benchmark and compare models quality and performance on a common
ground. MLCommons has emerged recently as a driving force from both industry
and academia to orchestrate such an effort. Despite its wide adoption as
standardized benchmarks, MLCommons Inference has only included a limited number
of ML/DL models (in fact seven models in total). This significantly limits the
generality of MLCommons Inference's benchmarking results because there are many
more novel ML/DL models from the research community, solving a wide range of
problems with different inputs and outputs modalities. To address such a
limitation, we propose MLHarness, a scalable benchmarking harness system for
MLCommons Inference with three distinctive features: (1) it codifies the
standard benchmark process as defined by MLCommons Inference including the
models, datasets, DL frameworks, and software and hardware systems; (2) it
provides an easy and declarative approach for model developers to contribute
their models and datasets to MLCommons Inference; and (3) it includes the
support of a wide range of models with varying inputs/outputs modalities so
that we can scalably benchmark these models across different datasets,
frameworks, and hardware systems. This harness system is developed on top of
the MLModelScope system, and will be open sourced to the community. Our
experimental results demonstrate the superior flexibility and scalability of
this harness system for MLCommons Inference benchmarking.
Related papers
- Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources.
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z) - xGen-MM (BLIP-3): A Family of Open Large Multimodal Models [157.44696790158784]
This report introduces xGen-MM, a framework for developing Large Multimodal Models (LMMs)
The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs.
Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
arXiv Detail & Related papers (2024-08-16T17:57:01Z) - A Large-Scale Study of Model Integration in ML-Enabled Software Systems [4.776073133338119]
Machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems.
Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them.
We present the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub.
arXiv Detail & Related papers (2024-08-12T15:28:40Z) - Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios.
As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths.
This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - ML-On-Rails: Safeguarding Machine Learning Models in Software Systems A
Case Study [4.087995998278127]
We introduce ML-On-Rails, a protocol designed to safeguard machine learning models.
ML-On-Rails establishes a well-defined endpoint interface for different ML tasks, and clear communication between ML providers and ML consumers.
We evaluate the protocol through a real-world case study of the MoveReminder application.
arXiv Detail & Related papers (2024-01-12T11:27:15Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - Counterfactual Explanations for Machine Learning on Multivariate Time
Series Data [0.9274371635733836]
This paper proposes a novel explainability technique for providing counterfactual explanations for supervised machine learning frameworks.
The proposed method outperforms state-of-the-art explainability methods on several different ML frameworks and data sets in metrics such as faithfulness and robustness.
arXiv Detail & Related papers (2020-08-25T02:04:59Z) - MLModelScope: A Distributed Platform for Model Evaluation and
Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them.
The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community.
This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.