Fugu-MT 論文翻訳(概要): FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

論文の概要: FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

arxiv url: http://arxiv.org/abs/2605.17373v1
Date: Sun, 17 May 2026 10:30:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 23:51:08.365852
Title: FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
Title（参考訳）: FML-bench: 探索ダイナミクスの観点からのAI研究エージェント戦略の制御された研究
Authors: Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu,
Abstract要約: FML-Benchは10ドメインにわたる18の基本的なML研究タスクのベンチマークである。エージェント戦略と実行インフラストラクチャを分離し、12のプロセスレベルの行動メトリクスを定義する。機会が密集している場合には欲求探索がより効果的になる傾向にあり、機会が不足している場合には、木探索と進化戦略がより効果的になる傾向にある。
参考スコア（独自算出の注目度）: 24.125726163497742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.
Abstract（参考訳）: AI研究エージェントは仮説生成、実験、経験的洗練を自動化することでML研究を加速する。既存のエージェント戦略は、グリーディヒルクライミングからツリー探索や進化的最適化まで様々であるが、どの戦略選択が性能を駆動するかは定かではない。この質問に答えるためには、エージェント戦略(例えば、検索トポロジ)を実行インフラストラクチャ(例えば、コードエディタ)から分離するベンチマークが必要である。既存のベンチマークは限定的なサポートを提供する。 FML-Benchは10ドメインにわたる18の基本的なML研究タスクのベンチマークであり、エージェント戦略と実行インフラストラクチャを分離し、12のプロセスレベルの行動メトリクスを定義する。その結果,(1) 戦略複雑度だけでは高い性能を保証できない,(1) 単純なグレディ・ヒルクライマーは,いずれも最も優れた木探索エージェントとほぼ一致している,(2) 分析により,このパターンは,機会が密集した場合には,より効果的になる傾向にある,一方,樹探索および進化的戦略は,機会が希薄な場合には,より効果的であることが示唆された; この洞察に基づいて構築された適応的エージェントは,他の6人のエージェントの停滞を検知し,その観察に初期支援を施し,(3) プロセスレベルの分析により,早期収束と方向性に焦点を絞った探索が最終性能に大きく関係している,という結果が得られた。私たちのベンチマークは、https://github.com/qrzou/FML-bench.comで公開されています。

論文の概要: FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

関連論文リスト