Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
- URL: http://arxiv.org/abs/2509.09321v1
- Date: Thu, 11 Sep 2025 10:10:48 GMT
- Title: Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
- Authors: Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei,
- Abstract summary: TAM Bench is a benchmark for evaluating large language models (LLMs) on end-to-end machine learning tasks.<n>Three key innovations include a browser automation and LLM-based task acquisition system.<n>Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes.
- Score: 8.356074728041202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition solving. However, existing benchmarks remain limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture the full capabilities of such agents in realistic settings. We present TAM Bench, a diverse, realistic, and structured benchmark for evaluating LLM-based agents on end-to-end ML tasks. TAM Bench features three key innovations: (1) A browser automation and LLM-based task acquisition system that automatically collects and structures ML challenges from platforms such as Kaggle, AIcrowd, and Biendata, spanning multiple task types and data modalities (e.g., tabular, text, image, graph, audio); (2) A leaderboard-driven difficulty modeling mechanism that estimates task complexity using participant counts and score dispersion, enabling scalable and objective task calibration; (3) A multi-dimensional evaluation framework incorporating performance, format compliance, constraint adherence, and task generalization. Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes -- Lite, Medium, and Full -- designed for varying evaluation scenarios. The Lite version, with 18 tasks and balanced coverage across modalities and difficulty levels, serves as a practical testbed for daily benchmarking and comparative studies.
Related papers
- ML-Tool-Bench: Tool-Augmented Planning for ML Tasks [23.54937738755734]
We introduce a benchmark for evaluating tool-augmented machine learning agents.<n>Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management.<n>Our approach improves over ReAct by 16.52 percentile positions, taking the median across all Kaggle challenges.
arXiv Detail & Related papers (2025-11-29T23:59:40Z) - MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents [7.339769470891067]
MSCoRe is a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors.<n>The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks.<n>MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents.
arXiv Detail & Related papers (2025-09-22T11:36:16Z) - Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z) - EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [64.70546873396624]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z) - EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments [11.97783742296183]
Embodied Mobile Manipulation in Open Environments is a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space.<n>Embodied Mobile Manipulation in Open Environments seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment.<n>We designmodel, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms.
arXiv Detail & Related papers (2025-03-11T16:42:36Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows [1.3654846342364308]
We present ML-Dev-Bench, a benchmark aimed at testing agentic capabilities on applied Machine Learning development tasks.<n>We evaluate three agents - ReAct, Openhands, and AIDE - on a diverse set of 30 tasks.<n>We open source the benchmark for the benefit of the community.
arXiv Detail & Related papers (2025-02-03T00:04:49Z) - EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - MAML-en-LLM: Model Agnostic Meta-Training of LLMs for Improved In-Context Learning [43.512739869120125]
We propose MAML-en-LLM, a novel method for meta-training large language models (LLMs)
MAML-en-LLM can learn truly generalizable parameters that not only perform well on disjointed tasks but also adapts to unseen tasks.
We demonstrate that MAML-en-LLM outperforms baselines in settings with limited amount of training data on both seen and unseen domains.
arXiv Detail & Related papers (2024-05-19T04:49:42Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.