Personalized Benchmarking with the Ludwig Benchmarking Toolkit
- URL: http://arxiv.org/abs/2111.04260v1
- Date: Mon, 8 Nov 2021 03:53:38 GMT
- Title: Personalized Benchmarking with the Ludwig Benchmarking Toolkit
- Authors: Avanika Narayan, Piero Molino, Karan Goel, Willie Neiswanger,
Christopher R\'e (Department of Computer Science, Stanford University)
- Abstract summary: Ludwig Benchmarking Toolkit (LBT) is a personalized benchmarking toolkit for running end-to-end benchmark studies.
LBT provides an interface for controlling training and customizing evaluation, a standardized training framework for eliminating confounding variables, and support for multi-objective evaluation.
We show how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets.
- Score: 12.347185532330919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid proliferation of machine learning models across domains and
deployment settings has given rise to various communities (e.g. industry
practitioners) which seek to benchmark models across tasks and objectives of
personal value. Unfortunately, these users cannot use standard benchmark
results to perform such value-driven comparisons as traditional benchmarks
evaluate models on a single objective (e.g. average accuracy) and fail to
facilitate a standardized training framework that controls for confounding
variables (e.g. computational budget), making fair comparisons difficult. To
address these challenges, we introduce the open-source Ludwig Benchmarking
Toolkit (LBT), a personalized benchmarking toolkit for running end-to-end
benchmark studies (from hyperparameter optimization to evaluation) across an
easily extensible set of tasks, deep learning models, datasets and evaluation
metrics. LBT provides a configurable interface for controlling training and
customizing evaluation, a standardized training framework for eliminating
confounding variables, and support for multi-objective evaluation. We
demonstrate how LBT can be used to create personalized benchmark studies with a
large-scale comparative analysis for text classification across 7 models and 9
datasets. We explore the trade-offs between inference latency and performance,
relationships between dataset attributes and performance, and the effects of
pretraining on convergence and robustness, showing how LBT can be used to
satisfy various benchmarking objectives.
Related papers
- Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.
We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - StaICC: Standardized Evaluation for Classification Task in In-context Learning [3.0531121420837226]
This paper proposes a standardized and easy-to-use evaluation toolkit (StaICC) for in-context classification.
For the normal classification task, we provide StaICC-Normal, selecting 10 widely used datasets, and generating prompts with a fixed form.
We also provide a sub-benchmark StaICC-Diag for diagnosing ICL from several aspects, aiming for a more robust inference processing.
arXiv Detail & Related papers (2025-01-27T00:05:12Z) - Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.
This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.
We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.
To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training.
We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale.
More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv Detail & Related papers (2024-06-14T17:59:54Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Dynaboard: An Evaluation-As-A-Service Platform for Holistic
Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison.
Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z) - RADDLE: An Evaluation Benchmark and Analysis Platform for Robust
Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains.
RADDLE is designed to favor and encourage models with a strong generalization ability.
We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z) - Interpretable Meta-Measure for Model Performance [4.91155110560629]
We introduce a new meta-score assessment named Elo-based Predictive Power (EPP)
EPP is built on top of other performance measures and allows for interpretable comparisons of models.
We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data.
arXiv Detail & Related papers (2020-06-02T14:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.