MLModelScope: A Distributed Platform for Model Evaluation and
Benchmarking at Scale
- URL: http://arxiv.org/abs/2002.08295v1
- Date: Wed, 19 Feb 2020 17:13:01 GMT
- Title: MLModelScope: A Distributed Platform for Model Evaluation and
Benchmarking at Scale
- Authors: Abdul Dakkak, Cheng Li, Jinjun Xiong, Wen-mei Hwu
- Abstract summary: Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them.
The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community.
This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
- Score: 32.62513495487506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine Learning (ML) and Deep Learning (DL) innovations are being introduced
at such a rapid pace that researchers are hard-pressed to analyze and study
them. The complicated procedures for evaluating innovations, along with the
lack of standard and efficient ways of specifying and provisioning ML/DL
evaluation, is a major "pain point" for the community. This paper proposes
MLModelScope, an open-source, framework/hardware agnostic, extensible and
customizable design that enables repeatable, fair, and scalable model
evaluation and benchmarking. We implement the distributed design with support
for all major frameworks and hardware, and equip it with web, command-line, and
library interfaces. To demonstrate MLModelScope's capabilities we perform
parallel evaluation and show how subtle changes to model evaluation pipeline
affects the accuracy and HW/SW stack choices affect performance.
Related papers
- Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts [13.478250875892414]
multimodal large language models (MLLMs) have received much attention for their impressive capabilities.
This paper analyzes this deficiency in existing benchmarks and introduces a new evaluation framework named TP-Eval.
TP-Eval will rewrite the original prompts to different customized prompts for different models.
arXiv Detail & Related papers (2024-10-23T17:54:43Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency.
The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z) - SWITCH: An Exemplar for Evaluating Self-Adaptive ML-Enabled Systems [1.2277343096128712]
Machine Learning-Enabled Systems (MLS) is crucial for maintaining Quality of Service (QoS)
The Machine Learning Model Balancer is a concept that addresses these uncertainties by facilitating dynamic ML model switching.
This paper introduces SWITCH, an exemplar developed to enhance self-adaptive capabilities in such systems.
arXiv Detail & Related papers (2024-02-09T11:56:44Z) - Model Share AI: An Integrated Toolkit for Collaborative Machine Learning
Model Development, Provenance Tracking, and Deployment in Python [0.0]
We introduce Model Share AI (AIMS), an easy-to-use MLOps platform designed to streamline collaborative model development, model provenance tracking, and model deployment.
AIMS features collaborative project spaces and a standardized model evaluation process that ranks model submissions based on their performance on unseen evaluation data.
AIMS allows users to deploy ML models built in Scikit-Learn, Keras, PyTorch, and ONNX into live REST APIs and automatically generated web apps.
arXiv Detail & Related papers (2023-09-27T15:24:39Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.