Related papers: MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments

MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments

URL: http://arxiv.org/abs/2601.22420v1
Date: Fri, 30 Jan 2026 00:16:35 GMT
Title: MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments
Authors: Roelien C. Timmer, Necva Bölücü, Stephen Wan,
Abstract summary: Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress.<n>We present MetaLead, a fully human-annotated dataset that captures all experimental results for result transparency.
Score: 2.8973763292318075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type: baseline, proposed method, or variation of proposed method for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research.

Related papers

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection [25.690005491942884]
Outlier detection on tabular data underpins numerous real-world applications.<n>The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets.<n>We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components.<n> Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of OD methods.
arXiv Detail & Related papers (2026-02-10T01:51:41Z)
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [48.73595915402094]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
A Position Paper on the Automatic Generation of Machine Learning Leaderboards [9.766725069582836]
An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards.<n>To ease this burden, researchers have developed methods to extract leaderboard entries from research papers.<n>Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability.<n>We propose an ALG unified conceptual framework to standardise how the ALG task is defined.
arXiv Detail & Related papers (2025-05-23T04:46:10Z)
League: Leaderboard Generation on Demand [67.69633959139523]
Leaderboard Auto Generation (LAG) is a framework for automatic generation of leaderboards on a given research topic.<n> faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings.<n>Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
arXiv Detail & Related papers (2025-02-25T13:54:03Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.<n>Data selection has shown promise in identifying the most representative samples from the entire dataset.<n>We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards [67.65408769829524]
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. automatic leaderboard construction has emerged as a solution to reduce manual labor.
arXiv Detail & Related papers (2024-09-19T11:12:27Z)
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts [0.0]
This paper introduces the MERIT dataset, a fully labeled dataset within the context of school reports. By its nature, the MERIT dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs) To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models.
arXiv Detail & Related papers (2024-08-31T12:56:38Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.