Fugu-MT 論文翻訳(概要): DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

論文の概要: DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

arxiv url: http://arxiv.org/abs/2606.17029v1
Date: Mon, 15 Jun 2026 17:52:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 18:36:05.132202
Title: DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
Title（参考訳）: DEEPRUBRIC:ディープリサーチエージェントの効率的な強化学習のためのエビデンス・トレー・ルーブリック・スーパービジョン
Authors: Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He,
Abstract要約: 報奨に基づく強化学習は、レポート品質を報奨信号に変換するチェック可能な基準に最適化することにより、ディープリサーチエージェントを改善する。既存の研究の多くは、LLMに与えられたクエリに対してルーリックを生成するよう求めているが、モデルが基盤となる情報要求を推測できない場合、生成されたルーリックは不完全であり、RL効率を低下させる可能性がある。より信頼性の高いクエリ-ルーブリックの監視を得るために、このプロセスを逆転するデータ構築フレームワークであるDeepRubricを紹介します。
参考スコア（独自算出の注目度）: 17.420077157633852
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.
Abstract（参考訳）: ディープリサーチエージェントは、回収された証拠を探索し、推論することで、ロングフォームレポートを合成する。報告品質を報奨信号に変換するチェック可能な基準に対して最適化することで、ルーブリックに基づく報酬による強化学習はこれらのエージェントを改善するが、その効率は、それらの基準がタスクの範囲とエビデンスのニーズを確実に捉えているかどうかに依存する。既存の研究の多くは、LLMに与えられたクエリに対してルーリックを生成するよう求めているが、モデルが基盤となる情報要求を推測できない場合、生成されたルーリックは不完全であり、RL効率を低下させる可能性がある。このプロセスを逆転させるデータ構築フレームワークであるDeepRubricを導入し、与えられたクエリに対する評価基準を推測する代わりに、まずエビデンスベースのレポートを評価すべきかどうかを判断し、その評価対象からアライメントされたクエリ-ルーブリックペアを合成する。 DeepRubricはサンプルのシードトピックから始まり、エビデンスに支えられたサブクエストを再帰的に拡張することでエビデンスツリーを構築し、その葉は原子的で検証可能な評価ターゲットとして機能する。次に、エビデンスツリーを使用して、トレーニングクエリとルーブリックを合成し、報酬がクエリが要求した情報を正確に評価することを保証する。 DeepRubricを使用して、9Kクエリ-ルーブリック監視の例を構築し、ルーブリックベースのGRPOでDeepRubric-8Bをトレーニングし、約13倍のRLGPU時間を持つ3つのベンチマークで、最先端の研究モデルに匹敵するパフォーマンスを実現した。

論文の概要: DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

関連論文リスト