Related papers: Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

URL: http://arxiv.org/abs/2210.01970v2
Date: Thu, 6 Oct 2022 16:12:17 GMT
Title: Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements
Authors: Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani, Victor Mustar, Helen Ngo, Omar Sanseviero, Mario \v{S}a\v{s}ko, Albert Villanova, Quentin Lhoest, Julien Chaumond, Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela
Abstract summary: evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Evaluation on the Hub is a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets.
Score: 167.73134600289603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support reproducibility of evaluation, centralize and document the evaluation process, and broaden evaluation to cover more facets of model performance. It includes over 50 efficient canonical implementations for a variety of domains and scenarios, interactive documentation, and the ability to easily share implementations and outcomes. The library is available at https://github.com/huggingface/evaluate. In addition, we introduce Evaluation on the Hub, a platform that enables the large-scale evaluation of over 75,000 models and 11,000 datasets on the Hugging Face Hub, for free, at the click of a button. Evaluation on the Hub is available at https://huggingface.co/autoevaluate.

Related papers

From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning [58.98865450345401]
We introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning.<n>LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS) and Audience Engagement (AE)<n>We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings.
arXiv Detail & Related papers (2025-05-04T12:06:47Z)
EvalGIM: A Library for Evaluating Generative Image Models [26.631349186382664]
We introduce EvalGIM, a library for evaluating text-to-image generative models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models.
arXiv Detail & Related papers (2024-12-13T23:15:35Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform [64.76104135495576]
We propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models. Through this platform, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities.
arXiv Detail & Related papers (2024-03-13T07:31:20Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
Evaluating Representations with Readout Model Switching [19.907607374144167]
In this paper, we propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. We design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures.
arXiv Detail & Related papers (2023-02-19T14:08:01Z)
Summary Workbench: Unifying Application and Evaluation of Text Summarization Models [24.40171915438056]
New models and evaluation measures can be easily integrated as Docker-based plugins. Visual analyses combining multiple measures provide insights into the models' strengths and weaknesses.
arXiv Detail & Related papers (2022-10-18T04:47:25Z)
On the Evaluation of RGB-D-based Categorical Pose and Shape Estimation [5.71097144710995]
In this work we take a critical look at this predominant evaluation protocol including metrics and datasets. We propose a new set of metrics, contribute new annotations for the Redwood dataset and evaluate state-of-the-art methods in a fair comparison.
arXiv Detail & Related papers (2022-02-21T16:31:18Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them. The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community. This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.