Fugu-MT 論文翻訳(概要): MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

論文の概要: MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

arxiv url: http://arxiv.org/abs/2604.17282v1
Date: Sun, 19 Apr 2026 06:44:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.434071
Title: MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
Title（参考訳）: MedPRMBench:医療推論におけるプロセスリワードモデルのためのきめ細かいベンチマーク
Authors: Lingyan Wu, Xiang Zheng, Weiqi Zhai, Wei Wang, Xuan Ren, Zifan Zhang, Hu Wei, Bing Zhao,
Abstract要約: 医療領域における最初のプロセスレベルの報酬モデルベンチマークであるMedPRMBenchを提案する。臨床共鳴ブループリント(CRB)に基づく3相パイプラインで構築される。ベンチマークには6500の質問と13,000の推論チェーン、113,910のステップレベルラベル、トレーニング用の6,879の質問が含まれている。
参考スコア（独自算出の注目度）: 7.000170880015254
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning -- which is uniquely characterized by safety criticality, knowledge intensity, and diverse error patterns. Without a reliable medical PRM evaluation framework, we cannot quantify models' error detection capabilities in clinical reasoning, leaving their safety in real-world healthcare applications unverified. We propose MedPRMBench, the first process-level reward model benchmark for the medical domain. Built through a three-phase pipeline based on Clinical Reasoning Blueprints (CRBs), MedPRMBench systematically generates high-quality evaluation data from seven medical QA sources, covering 14 fine-grained error types across three categories (Simplicity, Soundness, and Sensitivity) with the first 4-level severity grading system to quantify clinical impact. The benchmark comprises 6{,}500 questions with 13{,}000 reasoning chains and 113{,}910 step-level labels, plus 6{,}879 questions for training. Our medical PRM baseline achieves an 87.1\% overall PRMScore -- substantially surpassing all baselines -- and serves as a plug-and-play verifier that improves downstream medical QA accuracy by 3.2--6.7 percentage points. Systematic evaluation spanning proprietary frontier models, open-source reasoning models, and medical-specialized models reveals critical weaknesses in current models' medical reasoning error detection capabilities, providing clear directions for future PRM improvement.
Abstract（参考訳）: プロセスレベルリワードモデル(Process-Level Reward Models, PRM)は、大規模言語モデルにおいて複雑な推論を導くのに不可欠であるが、既存のPRMベンチマークは、数学のような一般的なドメインのみをカバーし、医療的推論に対処できない。信頼性の高い医療PRM評価フレームワークがなければ、臨床推論においてモデルのエラー検出能力を定量化できない。医療領域における最初のプロセスレベルの報酬モデルベンチマークであるMedPRMBenchを提案する。 MedPRMBenchはクリニカル・推論・ブループリント(CRB)に基づく3段階のパイプラインで構築され、7つの医療QAソースから高品質な評価データを体系的に生成し、3つのカテゴリ(単純性、音性、感度)にわたる14のきめ細かいエラータイプを、臨床効果を定量化する最初の4段階の重症度評価システムでカバーしている。ベンチマークには6{,}500の質問と13{,}000の推論チェーン、113{,}910のステップレベルラベル、トレーニング用の6{,}879の質問が含まれている。我々の医用PRMベースラインは、全ベースラインをほぼ上回る87.1\%のPRMSコアを達成し、下流医療QAの精度を3.2～6.7ポイント向上させるプラグアンドプレイ検証器として機能する。プロプライエタリフロンティアモデル、オープンソース推論モデル、医療特化モデルにまたがる体系的評価は、現在のモデルの医学推論エラー検出能力に重大な弱点を示し、将来のPRM改善のための明確な方向を提供する。

論文の概要: MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

関連論文リスト