Related papers: ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows

ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows

URL: http://arxiv.org/abs/2502.00964v3
Date: Wed, 19 Feb 2025 05:09:01 GMT
Title: ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows
Authors: Harshith Padigela, Chintan Shah, Dinkar Juyal,
Abstract summary: We present ML-Dev-Bench, a benchmark aimed at testing agentic capabilities on applied Machine Learning development tasks.<n>We evaluate three agents - ReAct, Openhands, and AIDE - on a diverse set of 30 tasks.<n>We open source the benchmark for the benefit of the community.
Score: 1.3654846342364308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this report, we present ML-Dev-Bench, a benchmark aimed at testing agentic capabilities on applied Machine Learning development tasks. While existing benchmarks focus on isolated coding tasks or Kaggle-style competitions, ML-Dev-Bench tests agents' ability to handle the full complexity of ML development workflows. The benchmark assesses performance across critical aspects including dataset handling, model training, improving existing models, debugging, and API integration with popular ML tools. We evaluate three agents - ReAct, Openhands, and AIDE - on a diverse set of 30 tasks, providing insights into their strengths and limitations in handling practical ML development challenges. We open source the benchmark for the benefit of the community at \href{https://github.com/ml-dev-bench/ml-dev-bench}{https://github.com/ml-dev-bench/ml-dev-bench}.

Related papers

A Framework for Testing and Adapting REST APIs as LLM Tools [5.758488787763118]
We present a novel testing framework aimed at evaluating and enhancing the readiness of REST APIs to function as tools for agents. Our framework transforms apis as tools, generates comprehensive test cases for the APIs, tests cases into natural language instructions and evaluates the agent's ability t correctly invoke the API and process its inputs and responses.
arXiv Detail & Related papers (2025-04-22T02:52:08Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z)
Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey [4.917456871628609]
Building effective machine learning (ML) to address complex tasks is a primary focus of the Automatic ML (AutoML) community.<n>Recently, the integration of Large Language Models (LLMs) into ML has shown great potential for automating and enhancing various stages of the ML pipeline.
arXiv Detail & Related papers (2024-11-11T21:54:26Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation. Specifically, task decomposition, tool selection, and parameter prediction are assessed. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
Reasonable Scale Machine Learning with Open-Source Metaflow [2.637746074346334]
We argue that re-purposing existing tools won't solve the current productivity issues. We introduce Metaflow, an open-source framework for ML projects explicitly designed to boost the productivity of data practitioners.
arXiv Detail & Related papers (2023-03-21T11:28:09Z)
Operationalizing Machine Learning: An Interview Study [13.300075655862573]
We conduct semi-structured interviews with 18 machine learning engineers (MLEs) working across many applications. Our interviews expose three variables that govern success for a production ML deployment: Velocity, Validation, and Versioning. We summarize common practices for successful ML experimentation, deployment, and sustaining production performance.
arXiv Detail & Related papers (2022-09-16T16:59:36Z)
MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them. The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community. This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.