Fugu-MT 論文翻訳(概要): Reinforcement Learning for Machine Learning Engineering Agents

論文の概要: Reinforcement Learning for Machine Learning Engineering Agents

arxiv url: http://arxiv.org/abs/2509.01684v1
Date: Mon, 01 Sep 2025 18:04:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.813834
Title: Reinforcement Learning for Machine Learning Engineering Agents
Title（参考訳）: 機械学習工学エージェントのための強化学習
Authors: Sherry Yang, Joy He-Yueya, Percy Liang,
Abstract要約: 強化学習によって改善される弱いモデルによって支援されるエージェントは、はるかに大きいが静的モデルによって支援されるエージェントよりも優れていることを示す。分散非同期RLフレームワークにおいて,高コストかつ高利回りな動作を増幅するための時間依存性の勾配更新を提案する。また,早期に失敗するプログラムとほぼ正しくないプログラムを区別し,部分クレジットを提供する環境機器を提案する。
参考スコア（独自算出の注目度）: 52.03168614623642
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing agents for solving tasks such as ML engineering rely on prompting powerful language models. As a result, these agents do not improve with more experience. In this paper, we show that agents backed by weaker models that improve via reinforcement learning (RL) can outperform agents backed by much larger, but static models. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. To tackle variable-duration actions, we propose duration- aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using only test split performance as a reward provides limited feedback. A program that is nearly correct is treated the same as one that fails entirely. To address this, we propose environment instrumentation to offer partial credit, distinguishing almost-correct programs from those that fail early (e.g., during data loading). Environment instrumentation uses a separate static language model to insert print statement to an existing program to log the agent's experimental progress, from which partial credit can be extracted as reward signals for learning. Our experimental results on MLEBench suggest that performing gradient updates on a much smaller model (Qwen2.5-3B) trained with RL outperforms prompting a much larger model (Claude-3.5-Sonnet) with agent scaffolds, by an average of 22% across 12 Kaggle tasks.
Abstract（参考訳）: MLエンジニアリングのようなタスクを解決するための既存のエージェントは、強力な言語モデルを促進することに依存している。結果として、これらのエージェントはより経験を積んでは改善しない。本稿では、強化学習(RL)により改善される弱いモデルによって支援されるエージェントが、より大きく、静的なモデルによって支援されるエージェントよりも優れていることを示す。この設定では、RLにおける2つの大きな課題を特定します。まず、アクションは可変時間(例えば、異なるソリューションでコードを実行する)を要し、非同期ポリシーの勾配が更新され、より高速だが最適でないソリューションが好まれます。分散非同期RLフレームワークにおいて,変数デューレーション動作に対処するために,高コストかつ高リワード動作を増幅するための時間依存性の勾配更新を提案する。第二に、テスト分割のパフォーマンスのみを報酬として使用すると、フィードバックは限られます。ほぼ正しいプログラムは、完全に失敗するプログラムと同じように扱われる。そこで本研究では,早期に故障したプログラム(例えばデータロード中)と,ほぼ正しいプログラムを区別して,部分クレジットを提供する環境機器を提案する。環境インスツルメンテーションは、別の静的言語モデルを使用して既存のプログラムに印刷文を挿入し、エージェントの実験的な進捗を記録する。 Claude-3.5-Sonnetは,12のKaggleタスクに対して平均22%の速度で,さらに大きなモデル (Claude-3.5-Sonnet) を実現する。

論文の概要: Reinforcement Learning for Machine Learning Engineering Agents

関連論文リスト