Evaluate & Evaluation on the Hub: Better Best Practices for Data and
Model Measurements
- URL: http://arxiv.org/abs/2210.01970v2
- Date: Thu, 6 Oct 2022 16:12:17 GMT
- Title: Evaluate & Evaluation on the Hub: Better Best Practices for Data and
Model Measurements
- Authors: Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha
Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani,
Victor Mustar, Helen Ngo, Omar Sanseviero, Mario \v{S}a\v{s}ko, Albert
Villanova, Quentin Lhoest, Julien Chaumond, Margaret Mitchell, Alexander M.
Rush, Thomas Wolf, Douwe Kiela
- Abstract summary: evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models.
Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets.
- Score: 167.73134600289603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation is a key part of machine learning (ML), yet there is a lack of
support and tooling to enable its informed and systematic practice. We
introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the
evaluation of models and datasets in ML. Evaluate is a library to support best
practices for measurements, metrics, and comparisons of data and models. Its
goal is to support reproducibility of evaluation, centralize and document the
evaluation process, and broaden evaluation to cover more facets of model
performance. It includes over 50 efficient canonical implementations for a
variety of domains and scenarios, interactive documentation, and the ability to
easily share implementations and outcomes. The library is available at
https://github.com/huggingface/evaluate. In addition, we introduce Evaluation
on the Hub, a platform that enables the large-scale evaluation of over 75,000
models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a
button. Evaluation on the Hub is available at
https://huggingface.co/autoevaluate.
Related papers
- CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Towards Personalized Evaluation of Large Language Models with An
Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models.
Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric.
We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions.
The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z) - Summary Workbench: Unifying Application and Evaluation of Text
Summarization Models [24.40171915438056]
New models and evaluation measures can be easily integrated as Docker-based plugins.
Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses.
arXiv Detail & Related papers (2022-10-18T04:47:25Z) - On the Evaluation of RGB-D-based Categorical Pose and Shape Estimation [5.71097144710995]
In this work we take a critical look at this predominant evaluation protocol including metrics and datasets.
We propose a new set of metrics, contribute new annotations for the Redwood dataset and evaluate state-of-the-art methods in a fair comparison.
arXiv Detail & Related papers (2022-02-21T16:31:18Z) - SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion.
We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics.
We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z) - MLModelScope: A Distributed Platform for Model Evaluation and
Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them.
The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community.
This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.