Personalized Benchmarking with the Ludwig Benchmarking Toolkit
- URL: http://arxiv.org/abs/2111.04260v1
- Date: Mon, 8 Nov 2021 03:53:38 GMT
- Title: Personalized Benchmarking with the Ludwig Benchmarking Toolkit
- Authors: Avanika Narayan, Piero Molino, Karan Goel, Willie Neiswanger,
Christopher R\'e (Department of Computer Science, Stanford University)
- Abstract summary: Ludwig Benchmarking Toolkit (LBT) is a personalized benchmarking toolkit for running end-to-end benchmark studies.
LBT provides an interface for controlling training and customizing evaluation, a standardized training framework for eliminating confounding variables, and support for multi-objective evaluation.
We show how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets.
- Score: 12.347185532330919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid proliferation of machine learning models across domains and
deployment settings has given rise to various communities (e.g. industry
practitioners) which seek to benchmark models across tasks and objectives of
personal value. Unfortunately, these users cannot use standard benchmark
results to perform such value-driven comparisons as traditional benchmarks
evaluate models on a single objective (e.g. average accuracy) and fail to
facilitate a standardized training framework that controls for confounding
variables (e.g. computational budget), making fair comparisons difficult. To
address these challenges, we introduce the open-source Ludwig Benchmarking
Toolkit (LBT), a personalized benchmarking toolkit for running end-to-end
benchmark studies (from hyperparameter optimization to evaluation) across an
easily extensible set of tasks, deep learning models, datasets and evaluation
metrics. LBT provides a configurable interface for controlling training and
customizing evaluation, a standardized training framework for eliminating
confounding variables, and support for multi-objective evaluation. We
demonstrate how LBT can be used to create personalized benchmark studies with a
large-scale comparative analysis for text classification across 7 models and 9
datasets. We explore the trade-offs between inference latency and performance,
relationships between dataset attributes and performance, and the effects of
pretraining on convergence and robustness, showing how LBT can be used to
satisfy various benchmarking objectives.
Related papers
- A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets [0.6144680854063939]
We introduce a benchmark aimed at better characterizing types of datasets where Deep Learning models excel.
We evaluate 111 datasets with 20 different models, including both regression and classification tasks.
Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy.
arXiv Detail & Related papers (2024-08-27T06:58:52Z) - POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation [76.67608003501479]
We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators.
The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z) - Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training.
We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale.
More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv Detail & Related papers (2024-06-14T17:59:54Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z) - Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z) - RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains.
RADDLE is designed to favor and encourage models with a strong generalization ability.
We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z) - Interpretable Meta-Measure for Model Performance [4.91155110560629]
We introduce a new meta-score assessment named Elo-based Predictive Power (EPP)
EPP is built on top of other performance measures and allows for interpretable comparisons of models.
We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data.
arXiv Detail & Related papers (2020-06-02T14:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.