Fugu-MT 論文翻訳(概要): Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

論文の概要: Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

arxiv url: http://arxiv.org/abs/2604.24198v1
Date: Mon, 27 Apr 2026 09:00:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.869005
Title: Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
Title（参考訳）: 科学的プロセスのリワード:エージェントデータ分析のためのプロセスレベルリワードモデリング
Authors: Zhisong Qiu, Shuofei Qiao, Kewei Xu, Yuqi Zhu, Lun Du, Ningyu Zhang, Huajun Chen,
Abstract要約: プロセス・リワード・モデル(PRM)は、LLM(Large Language Models)の推論能力を増強することに成功した。本稿では,一般ドメインのPRMがデータ分析エージェントの監督に苦慮していることを示す。本稿では,新しい環境対応生成プロセス報酬モデルであるDataPRMを紹介する。
参考スコア（独自算出の注目度）: 68.28714988482703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
Abstract（参考訳）: Process Reward Models (PRM) は、数学のような静的領域におけるLarge Language Models (LLM) の推論能力を増強することに成功した。しかし、その動的データ解析タスクのポテンシャルはいまだ解明されていない。本研究では、まず、一般ドメインのPRMがデータ分析エージェントを監督するのに苦労していることを明らかにする。具体的には、サイレントエラー、インタプリタ例外を発生させることなく誤った結果をもたらす論理的欠陥、探索的な動作を誤ってペナルティ化し、失敗の根拠となるために必要な試行錯誤を誤る。このギャップを埋めるために,1)アクティブ検証として機能し,その環境と自律的に相互作用して中間実行状態を探索し,サイレントエラーを明らかにする,新しい環境認識型生成プロセス報酬モデルであるDataPRMを導入し,(2)修正可能な基底誤差と発見不可能な誤りを区別するリフレクション対応の3次報酬戦略を採用する。データPRMの8K以上の高品質なトレーニングインスタンスを、多様性駆動の軌道生成と知識強化されたステップレベルのアノテーションによって構築するスケーラブルなパイプラインを設計する。実験の結果、DataPRMはScienceAgentBenchで7.21%、DABStepで11.28%改善し、Best-of-N推論で改善した。注目すべきなのは、4Bパラメータだけで、DataPRMは強力なベースラインを上回り、さまざまなTest-Time Scaling戦略にまたがる堅牢な一般化性を示していることだ。さらに、DataPRMを強化学習に統合すると、結果逆ベースラインよりも大幅に向上し、DABenchが78.73%、TableBenchが64.84%となり、プロセス報酬管理の有効性が検証された。コードはhttps://github.com/zjunlp/DataMind.comで入手できる。

論文の概要: Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

関連論文リスト