LAG: LLM agents for Leaderboard Auto Generation on Demanding
- URL: http://arxiv.org/abs/2502.18209v1
- Date: Tue, 25 Feb 2025 13:54:03 GMT
- Title: LAG: LLM agents for Leaderboard Auto Generation on Demanding
- Authors: Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang,
- Abstract summary: Leaderboard Auto Generation (LAG) is a framework for automatic generation of leaderboards on a given research topic.<n> faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings.<n>Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
- Score: 38.53050861010012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards [67.65408769829524]
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods.
The exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually.
automatic leaderboard construction has emerged as a solution to reduce manual labor.
arXiv Detail & Related papers (2024-09-19T11:12:27Z) - AutoSurvey: Large Language Models Can Automatically Write Surveys [77.0458309675818]
This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys.
Traditional survey paper creation faces challenges due to the vast volume and complexity of information.
Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.
arXiv Detail & Related papers (2024-06-10T12:56:06Z) - Exploring the Latest LLMs for Leaderboard Extraction [0.3072340427031969]
This paper investigates the efficacy of different LLMs-ralMist 7B, Llama GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles.
Our study evaluates the performance of these models in generating (Task, Metric, Score) quadruples from research papers.
arXiv Detail & Related papers (2024-06-06T05:54:45Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z) - Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
Language Processing Leaderboards [5.919860270977038]
We argue that evaluation on a given test dataset is just one of many performance indications of the model.
We propose a new paradigm of leaderboard systems that addresses these issues of current leaderboard system.
arXiv Detail & Related papers (2023-03-20T06:13:03Z) - Towards Green Automated Machine Learning: Status Quo and Future
Directions [71.86820260846369]
AutoML is being criticised for its high resource consumption.
This paper proposes Green AutoML, a paradigm to make the whole AutoML process more environmentally friendly.
arXiv Detail & Related papers (2021-11-10T18:57:27Z) - Automated Mining of Leaderboards for Empirical AI Research [0.0]
This study presents a comprehensive approach for generating Leaderboards for knowledge-graph-based scholarly information organization.
Specifically, we investigate the problem of automated Leaderboard construction using state-of-the-art transformer models, viz. Bert, SciBert, and XLNet.
As a result, a vast share of empirical AI research can be organized in the next-generation digital libraries as knowledge graphs.
arXiv Detail & Related papers (2021-08-31T10:00:52Z) - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation [83.10599735938618]
Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository.
This work introduces GENIE, an human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks.
arXiv Detail & Related papers (2021-01-17T00:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.