Fugu-MT 論文翻訳(概要): ML-Tool-Bench: Tool-Augmented Planning for ML Tasks

論文の概要: ML-Tool-Bench: Tool-Augmented Planning for ML Tasks

arxiv url: http://arxiv.org/abs/2512.00672v1
Date: Sat, 29 Nov 2025 23:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 19:46:34.351951
Title: ML-Tool-Bench: Tool-Augmented Planning for ML Tasks
Title（参考訳）: ML-Tool-Bench: MLタスクのためのツール拡張プランニング
Authors: Yaswanth Chittepu, Raghavendra Addanki, Tung Mai, Anup Rao, Branislav Kveton,
Abstract要約: ツール強化機械学習エージェントの評価のためのベンチマークを導入する。私たちのベンチマークは、インメモリ名のオブジェクト管理を組み込むことで、従来のツール使用の評価を超えています。我々のアプローチはReActよりも16.2%向上し、すべてのKaggle課題の中央値を取ります。
参考スコア（独自算出の注目度）: 23.54937738755734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of autonomous machine learning (ML) agents capable of end-to-end data science workflows represents a significant frontier in artificial intelligence. These agents must orchestrate complex sequences of data analysis, feature engineering, model selection, and hyperparameter optimization, tasks that require sophisticated planning and iteration. While recent work on building ML agents has explored using large language models (LLMs) for direct code generation, tool-augmented approaches offer greater modularity and reliability. However, existing tool-use benchmarks focus primarily on task-specific tool selection or argument extraction for tool invocation, failing to evaluate the sophisticated planning capabilities required for ML Agents. In this work, we introduce a comprehensive benchmark for evaluating tool-augmented ML agents using a curated set of 61 specialized tools and 15 tabular ML challenges from Kaggle. Our benchmark goes beyond traditional tool-use evaluation by incorporating an in-memory named object management, allowing agents to flexibly name, save, and retrieve intermediate results throughout the workflows. We demonstrate that standard ReAct-style approaches struggle to generate valid tool sequences for complex ML pipelines, and that tree search methods with LLM-based evaluation underperform due to inconsistent state scoring. To address these limitations, we propose two simple approaches: 1) using shaped deterministic rewards with structured textual feedback, and 2) decomposing the original problem into a sequence of sub-tasks, which significantly improves trajectory validity and task performance. Using GPT-4o, our approach improves over ReAct by 16.52 percentile positions, taking the median across all Kaggle challenges. We believe our work provides a foundation for developing more capable tool-augmented planning ML agents.
Abstract（参考訳）: エンドツーエンドのデータサイエンスワークフローが可能な自律機械学習(ML)エージェントの開発は、人工知能における重要なフロンティアである。これらのエージェントは、複雑なデータ分析、機能エンジニアリング、モデル選択、ハイパーパラメータ最適化、高度な計画とイテレーションを必要とするタスクを編成する必要があります。 MLエージェントの構築に関する最近の研究は、直接コード生成に大規模な言語モデル(LLM)を使用することを検討されているが、ツール拡張アプローチにより、よりモジュール化と信頼性が向上している。しかし、既存のツール使用ベンチマークは、主にタスク固有のツール選択やツール呼び出しの引数抽出に重点を置いており、MLエージェントに必要な高度な計画能力の評価に失敗している。本稿では,ツール拡張MLエージェント評価のための総合的ベンチマークについて,Kaggleから61の専門ツールと15の表形式のML課題をキュレートしたセットを用いて紹介する。私たちのベンチマークは、インメモリに名前付きオブジェクト管理を導入し、エージェントがワークフロー全体を通して柔軟に名前をつけ、保存し、中間結果を取得することによって、従来のツール使用の評価を超えています。従来のReActスタイルのアプローチは,複雑なMLパイプラインに対して有効なツールシーケンスを生成するのに苦労し,LLMに基づく評価手法が不整合状態スコアリングによって不整合性を示す。これらの制限に対処するため、我々は2つの簡単なアプローチを提案する。 1) 構造化テキストフィードバックを用いた定型的報酬を用いた場合 2)元の問題をサブタスクのシーケンスに分解することで,軌道の妥当性とタスク性能を大幅に向上させる。 GPT-4oを用いて、我々のアプローチはReActよりも16.2%向上し、すべてのKaggle課題の中央値を取る。私たちは、より有能なツール強化計画MLエージェントを開発するための基盤を提供すると信じています。

論文の概要: ML-Tool-Bench: Tool-Augmented Planning for ML Tasks

関連論文リスト