Related papers: Personalized Benchmarking with the Ludwig Benchmarking Toolkit

Personalized Benchmarking with the Ludwig Benchmarking Toolkit

URL: http://arxiv.org/abs/2111.04260v1
Date: Mon, 8 Nov 2021 03:53:38 GMT
Title: Personalized Benchmarking with the Ludwig Benchmarking Toolkit
Authors: Avanika Narayan, Piero Molino, Karan Goel, Willie Neiswanger, Christopher R\'e (Department of Computer Science, Stanford University)
Abstract summary: Ludwig Benchmarking Toolkit (LBT) is a personalized benchmarking toolkit for running end-to-end benchmark studies. LBT provides an interface for controlling training and customizing evaluation, a standardized training framework for eliminating confounding variables, and support for multi-objective evaluation. We show how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets.
Score: 12.347185532330919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid proliferation of machine learning models across domains and deployment settings has given rise to various communities (e.g. industry practitioners) which seek to benchmark models across tasks and objectives of personal value. Unfortunately, these users cannot use standard benchmark results to perform such value-driven comparisons as traditional benchmarks evaluate models on a single objective (e.g. average accuracy) and fail to facilitate a standardized training framework that controls for confounding variables (e.g. computational budget), making fair comparisons difficult. To address these challenges, we introduce the open-source Ludwig Benchmarking Toolkit (LBT), a personalized benchmarking toolkit for running end-to-end benchmark studies (from hyperparameter optimization to evaluation) across an easily extensible set of tasks, deep learning models, datasets and evaluation metrics. LBT provides a configurable interface for controlling training and customizing evaluation, a standardized training framework for eliminating confounding variables, and support for multi-objective evaluation. We demonstrate how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets. We explore the trade-offs between inference latency and performance, relationships between dataset attributes and performance, and the effects of pretraining on convergence and robustness, showing how LBT can be used to satisfy various benchmarking objectives.

Related papers

Test-time Offline Reinforcement Learning on Goal-related Experience [50.94457794664909]
Research in foundation models has shown that performance can be substantially improved through test-time training.<n>We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state.<n>Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out.
arXiv Detail & Related papers (2025-07-24T21:11:39Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
StaICC: Standardized Evaluation for Classification Task in In-context Learning [3.0531121420837226]
This paper proposes a standardized and easy-to-use evaluation toolkit (StaICC) for in-context classification. For the normal classification task, we provide StaICC-Normal, selecting 10 widely used datasets, and generating prompts with a fixed form. We also provide a sub-benchmark StaICC-Diag for diagnosing ICL from several aspects, aiming for a more robust inference processing.
arXiv Detail & Related papers (2025-01-27T00:05:12Z)
Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions. This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context. We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement. To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z)
A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets [0.6144680854063939]
We introduce a benchmark aimed at better characterizing types of datasets where Deep Learning models excel. We evaluate 111 datasets with 20 different models, including both regression and classification tasks. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy.
arXiv Detail & Related papers (2024-08-27T06:58:52Z)
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation [76.67608003501479]
We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators. The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z)
Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training. We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale. More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv Detail & Related papers (2024-06-14T17:59:54Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals. Model-to-Match uses variable importance measurements to construct a distance metric. We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z)
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking [41.99715850562528]
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset.
arXiv Detail & Related papers (2021-05-21T01:17:52Z)
RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems [75.87418236410296]
We introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. RADDLE is designed to favor and encourage models with a strong generalization ability. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain.
arXiv Detail & Related papers (2020-12-29T08:58:49Z)
BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift. We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
Interpretable Meta-Measure for Model Performance [4.91155110560629]
We introduce a new meta-score assessment named Elo-based Predictive Power (EPP) EPP is built on top of other performance measures and allows for interpretable comparisons of models. We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data.
arXiv Detail & Related papers (2020-06-02T14:10:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.