Related papers: MLHarness: A Scalable Benchmarking System for MLCommons

MLHarness: A Scalable Benchmarking System for MLCommons

URL: http://arxiv.org/abs/2111.05231v1
Date: Tue, 9 Nov 2021 16:11:49 GMT
Title: MLHarness: A Scalable Benchmarking System for MLCommons
Authors: Yen-Hsiang Chang, Jianhao Pu, Wen-mei Hwu, Jinjun Xiong
Abstract summary: We propose a scalable benchmarking harness system for MLCommons Inference. It codifies the standard benchmark process as defined by MLCommons Inference. It provides an easy and declarative approach for model developers to contribute their models and datasets to MLCommons Inference.
Score: 16.490366217665205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the society's growing adoption of machine learning (ML) and deep learning (DL) for various intelligent solutions, it becomes increasingly imperative to standardize a common set of measures for ML/DL models with large scale open datasets under common development practices and resources so that people can benchmark and compare models quality and performance on a common ground. MLCommons has emerged recently as a driving force from both industry and academia to orchestrate such an effort. Despite its wide adoption as standardized benchmarks, MLCommons Inference has only included a limited number of ML/DL models (in fact seven models in total). This significantly limits the generality of MLCommons Inference's benchmarking results because there are many more novel ML/DL models from the research community, solving a wide range of problems with different inputs and outputs modalities. To address such a limitation, we propose MLHarness, a scalable benchmarking harness system for MLCommons Inference with three distinctive features: (1) it codifies the standard benchmark process as defined by MLCommons Inference including the models, datasets, DL frameworks, and software and hardware systems; (2) it provides an easy and declarative approach for model developers to contribute their models and datasets to MLCommons Inference; and (3) it includes the support of a wide range of models with varying inputs/outputs modalities so that we can scalably benchmark these models across different datasets, frameworks, and hardware systems. This harness system is developed on top of the MLModelScope system, and will be open sourced to the community. Our experimental results demonstrate the superior flexibility and scalability of this harness system for MLCommons Inference benchmarking.

Related papers

Benchmarking Android Malware Detection: Rethinking the Role of Traditional and Deep Learning Models [6.9053043489744015]
Android malware detection has been extensively studied using both traditional machine learning (ML) and deep learning (DL) approaches. While many state-of-the-art detection models claim superior performance, they often rely on limited comparisons. This raises concerns about the robustness of DL-based approaches' performance and the potential oversight of simpler, more efficient ML models.
arXiv Detail & Related papers (2025-02-20T20:56:05Z)
Performance Law of Large Language Models [58.32539851241063]
Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources. Performance law can be used to guide the choice of LLM architecture and the effective allocation of computational resources without extensive experiments.
arXiv Detail & Related papers (2024-08-19T11:09:12Z)
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models [157.44696790158784]
This report introduces xGen-MM, a framework for developing Large Multimodal Models (LMMs) The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
arXiv Detail & Related papers (2024-08-16T17:57:01Z)
A Large-Scale Study of Model Integration in ML-Enabled Software Systems [4.776073133338119]
Machine learning (ML) and its embedding in systems has drastically changed the engineering of software-intensive systems. Traditionally, software engineering focuses on manually created artifacts such as source code and the process of creating them. We present the first large-scale study of real ML-enabled software systems, covering over 2,928 open source systems on GitHub.
arXiv Detail & Related papers (2024-08-12T15:28:40Z)
Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and Toolbox [46.39670209441478]
Large language models (LLMs) have exhibited exciting progress in multiple scenarios. As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths. This work provides a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox.
arXiv Detail & Related papers (2024-06-15T12:02:14Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z)
ML-On-Rails: Safeguarding Machine Learning Models in Software Systems A Case Study [4.087995998278127]
We introduce ML-On-Rails, a protocol designed to safeguard machine learning models. ML-On-Rails establishes a well-defined endpoint interface for different ML tasks, and clear communication between ML providers and ML consumers. We evaluate the protocol through a real-world case study of the MoveReminder application.
arXiv Detail & Related papers (2024-01-12T11:27:15Z)
ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks. We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs. We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z)
Counterfactual Explanations for Machine Learning on Multivariate Time Series Data [0.9274371635733836]
This paper proposes a novel explainability technique for providing counterfactual explanations for supervised machine learning frameworks. The proposed method outperforms state-of-the-art explainability methods on several different ML frameworks and data sets in metrics such as faithfulness and robustness.
arXiv Detail & Related papers (2020-08-25T02:04:59Z)
MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them. The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community. This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.