A Position Paper on the Automatic Generation of Machine Learning Leaderboards
- URL: http://arxiv.org/abs/2505.17465v1
- Date: Fri, 23 May 2025 04:46:10 GMT
- Title: A Position Paper on the Automatic Generation of Machine Learning Leaderboards
- Authors: Roelien C Timmer, Yufang Hou, Stephen Wan,
- Abstract summary: An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards.<n>To ease this burden, researchers have developed methods to extract leaderboard entries from research papers.<n>Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability.<n>We propose an ALG unified conceptual framework to standardise how the ALG task is defined.
- Score: 12.736094044510224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.
Related papers
- Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - LAG: LLM agents for Leaderboard Auto Generation on Demanding [38.53050861010012]
Leaderboard Auto Generation (LAG) is a framework for automatic generation of leaderboards on a given research topic.<n> faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings.<n>Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
arXiv Detail & Related papers (2025-02-25T13:54:03Z) - Transformers Utilization in Chart Understanding: A Review of Recent Advances & Future Trends [1.124958340749622]
This paper reviews prominent research in Understanding (CU)
It focuses on State-of-The-Art (SoTA) frameworks that employ transformers within End-to-End (E2E) solutions.
This article identifies key challenges and outlines promising future directions for advancing CU solutions.
arXiv Detail & Related papers (2024-10-05T16:26:44Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards [67.65408769829524]
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods.
The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually.
automatic leaderboard construction has emerged as a solution to reduce manual labor.
arXiv Detail & Related papers (2024-09-19T11:12:27Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation [5.3558730908641525]
We propose a first benchmark dataset, CAMERA, to standardize the task of ATG.
Our experiments show the current state and the remaining challenges.
We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.
arXiv Detail & Related papers (2023-09-21T12:51:24Z) - Resources for Brewing BEIR: Reproducible Reference Models and an
Official Leaderboard [47.73060223236792]
BEIR is a benchmark dataset for evaluation of information retrieval models across 18 different domain/task combinations.
Our work addresses two shortcomings that prevent the benchmark from achieving its full potential.
arXiv Detail & Related papers (2023-06-13T00:26:18Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.