Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
- URL: http://arxiv.org/abs/2409.12656v1
- Date: Thu, 19 Sep 2024 11:12:27 GMT
- Title: Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
- Authors: Furkan Şahinuç, Thy Thy Tran, Yulia Grishina, Yufang Hou, Bei Chen, Iryna Gurevych,
- Abstract summary: Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods.
The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually.
automatic leaderboard construction has emerged as a solution to reduce manual labor.
- Score: 67.65408769829524
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.
Related papers
- Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search [25.108044778194536]
We introduce IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search), a scalable framework for efficiently synthesizing instructions.
With tree search and evaluation models, it can efficiently guide each instruction to evolve into a high-quality form, aiding in instruction fine-tuning.
Experimental results show that IDEA-MCTS significantly enhances the seed instruction data, raising the average evaluation scores of quality, diversity, and complexity from 2.19 to 3.81.
arXiv Detail & Related papers (2024-10-14T11:28:30Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
Language Processing Leaderboards [5.919860270977038]
We argue that evaluation on a given test dataset is just one of many performance indications of the model.
We propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system.
arXiv Detail & Related papers (2023-03-20T06:13:03Z) - CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented
Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions.
We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD.
Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z) - Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian
SuperGLUE Tasks [2.6189995284654737]
Leader-boards like SuperGLUE are seen as important incentives for active development of NLP.
We show that its test datasets are vulnerable to shallows.
It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallows.
arXiv Detail & Related papers (2021-05-03T22:19:22Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z) - Supervision Levels Scale (SLS) [37.944946917484444]
We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data.
The proposed three-dimensional scale can be included in result tables or leaderboards to handily compare methods not only by their performance, but also by the level of data supervision utilised by each method.
arXiv Detail & Related papers (2020-08-22T18:03:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.