SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
- URL: http://arxiv.org/abs/2409.07440v1
- Date: Wed, 11 Sep 2024 17:37:48 GMT
- Title: SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
- Authors: Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot,
- Abstract summary: Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
- Score: 55.161075901665946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.
Related papers
- How to Get Your LLM to Generate Challenging Problems for Evaluation [33.625052642068624]
CHASE is a unified framework to synthetically generate challenging problems using Large Language Models.
We implement CHASE to create evaluation benchmarks across three diverse domains.
The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy.
arXiv Detail & Related papers (2025-02-20T16:09:55Z) - An Empirical Exploration of ChatGPT's Ability to Support Problem Formulation Tasks for Mission Engineering and a Documentation of its Performance Variability [0.0]
This paper explores the quality and consistency of large language models (LLM) in supporting mission engineering problem formulation tasks.
We identify a relevant reference problem, a NASA space mission design challenge, and document ChatGPT-3.5's ability to perform stakeholder identification tasks.
We find that the LLM performs well in identifying human-focused stakeholders but poorly in recognizing external systems and environmental factors.
arXiv Detail & Related papers (2025-02-05T17:58:23Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - Learning Task Decomposition to Assist Humans in Competitive Programming [90.4846613669734]
We introduce a novel objective for learning task decomposition, termed value (AssistV)
We collect a dataset of human repair experiences on different decomposed solutions.
Under 177 hours of human study, our method enables non-experts to solve 33.3% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
arXiv Detail & Related papers (2024-06-07T03:27:51Z) - PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler.
Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code.
Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z) - Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision [98.97575836717931]
Current AI alignment methodologies rely on human-provided demonstrations or judgments.
This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans?
arXiv Detail & Related papers (2024-03-14T15:12:38Z) - Effectiveness Assessment of Recent Large Vision-Language Models [78.69439393646554]
This paper endeavors to evaluate the competency of popular large vision-language models (LVLMs) in specialized and general tasks.
We employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial.
We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks.
arXiv Detail & Related papers (2024-03-07T08:25:27Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - Learning to Solve Complex Tasks by Talking to Agents [39.08818632689814]
Humans often solve complex problems by interacting with existing agents, such as AI assistants, that can solve simpler sub-tasks.
Common NLP benchmarks aim for the development of self-sufficient models for every task.
We propose a new benchmark called CommaQA that contains three kinds of complex reasoning tasks designed to be solved by talking'' to four agents with different capabilities.
arXiv Detail & Related papers (2021-10-16T10:37:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.