Fugu-MT 論文翻訳(概要): FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

論文の概要: FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

arxiv url: http://arxiv.org/abs/2510.10472v1
Date: Sun, 12 Oct 2025 06:41:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.960693
Title: FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth
Title（参考訳）: FML-bench: 探索ブレッドスの重要性を強調した自動MLリサーチエージェントのベンチマーク
Authors: Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu,
Abstract要約: 大規模言語モデル(LLM)は、自動機械学習研究エージェントへの関心が高まっている。既存のベンチマークは、学術的な厳格さを無視しながら、エンジニアリングの側面を過度に強調する傾向がある。 FML-benchは、機械学習の自動研究エージェントを、多種多様な8つの基礎的な機械学習研究問題に対して評価するために設計されたベンチマークである。
参考スコア（独自算出の注目度）: 43.606494515048524
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自動機械学習研究エージェントへの関心が高まっている。中でも、研究の自動化を最大化し、実験結果に基づいてアイデアを反復的に精錬することによって科学的進歩を加速するため、アイデアを自律的に提案し、機械学習の実験を行うエージェントは特に有望である。しかし、こうしたエージェントを総合的に評価することは依然として困難である。既存のベンチマークは、学術的な厳格さを無視しながらエンジニアリングの側面を過度に強調し、機械学習研究におけるエージェントの科学的能力を明確に評価する障壁を生じさせる傾向がある。彼らはまた、タスクの多様性の制限、基本的な研究問題に対するアプリケーション指向のタスクの過大評価、現実的な研究設定へのスケーラビリティの制限も抱えています。これらの制限に対処するために、FML-benchは、機械学習研究の8つの多種多様な基礎的な問題に対して、自動機械学習研究エージェントを評価するために設計されたベンチマークである。コーディングの負担を軽減し、特定のユースケースよりも根本的な問題を強調し、タスクの多様性を高め、現実世界の機械学習GitHubリポジトリに拡張可能である。さらに,ベンチマークでエージェントのパフォーマンスを総合的に評価する5つの相補的指標を備えた統合評価フレームワークを提案する。我々は、FMLベンチにおける最先端の自動調査エージェントの評価を行い、より広い調査戦略を採用するエージェントが、狭く深い調査に焦点を当てたエージェントよりも優れていることを発見した。これらの結果から,探索の幅の広さを強調することは,漸進的な改良にのみ焦点をあてるよりも,より効果的な研究成果をもたらす可能性が示唆された。ベンチマークはhttps://github.com/qrzou/FML-bench.comで公開しています。

論文の概要: FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

関連論文リスト