Fugu-MT 論文翻訳(概要): Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization

論文の概要: Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization

arxiv url: http://arxiv.org/abs/2509.09321v1
Date: Thu, 11 Sep 2025 10:10:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-12 16:52:24.333138
Title: Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
Title（参考訳）: 適応型MLベンチマークに向けて: Web-Agent-Driven Construction, Domain Expansion, Metric Optimization
Authors: Hangyi Jia, Yuxi Qian, Hanwen Tong, Xinhui Wu, Lin Chen, Feng Wei,
Abstract要約: TAM Benchは、エンドツーエンドの機械学習タスクで大規模言語モデル(LLM)を評価するためのベンチマークである。 3つの重要なイノベーションは、ブラウザの自動化とLLMベースのタスク獲得システムである。 150のキュレートされたAutoMLタスクに基づいて、異なるサイズのベンチマークサブセットを3つ構築する。
参考スコア（独自算出の注目度）: 8.356074728041202
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have enabled the emergence of general-purpose agents for automating end-to-end machine learning (ML) workflows, including data analysis, feature engineering, model training, and competition solving. However, existing benchmarks remain limited in task coverage, domain diversity, difficulty modeling, and evaluation rigor, failing to capture the full capabilities of such agents in realistic settings. We present TAM Bench, a diverse, realistic, and structured benchmark for evaluating LLM-based agents on end-to-end ML tasks. TAM Bench features three key innovations: (1) A browser automation and LLM-based task acquisition system that automatically collects and structures ML challenges from platforms such as Kaggle, AIcrowd, and Biendata, spanning multiple task types and data modalities (e.g., tabular, text, image, graph, audio); (2) A leaderboard-driven difficulty modeling mechanism that estimates task complexity using participant counts and score dispersion, enabling scalable and objective task calibration; (3) A multi-dimensional evaluation framework incorporating performance, format compliance, constraint adherence, and task generalization. Based on 150 curated AutoML tasks, we construct three benchmark subsets of different sizes -- Lite, Medium, and Full -- designed for varying evaluation scenarios. The Lite version, with 18 tasks and balanced coverage across modalities and difficulty levels, serves as a practical testbed for daily benchmarking and comparative studies.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、データ分析、機能エンジニアリング、モデルトレーニング、競合解決を含むエンドツーエンド機械学習(ML)ワークフローを自動化する汎用エージェントの出現を可能にしている。しかし、既存のベンチマークはタスクカバレッジ、ドメインの多様性、難易度モデリング、評価の厳密さに限られており、現実的な環境ではそのようなエージェントの能力をフルに捉えられなかった。エンド・ツー・エンドのMLタスク上でLLMベースのエージェントを評価するための多様で現実的で構造化されたベンチマークであるTAM Benchを提案する。ブラウザの自動化とLLMベースのタスク取得システムで,Kaggle, AIcrowd, BiendataなどのプラットフォームからMLの課題を自動的に収集し,構造化する。複数のタスクタイプとデータモダリティ(例えば,表,テキスト,画像,グラフ,オーディオ)にまたがる。 150のキュレートされたAutoMLタスクに基づいて、さまざまな評価シナリオのために設計された3つのベンチマークサブセット(Lite、Medium、Full)を構築します。 Liteバージョンは18のタスクと、モダリティと難易度をまたいだバランスの取れたカバレッジを持ち、日々のベンチマークや比較研究のための実践的なテストベッドとして機能する。

論文の概要: Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization

関連論文リスト