Fugu-MT 論文翻訳(概要): GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

論文の概要: GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

arxiv url: http://arxiv.org/abs/2605.01203v2
Date: Thu, 07 May 2026 08:32:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 17:36:05.952755
Title: GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
Title（参考訳）: GR-Ben:プロセスリワードモデル評価のための一般的な推論ベンチマーク
Authors: Zhouhao Sun, Xuan Zhang, Xiao Ding, Bibo Cai, Li Du, Kai Xiong, Xinran Dai, Fei Zhang, weidi tang, Zhiyuan Kan, Yang Zhao, Bing Qin, Ting Liu,
Abstract要約: GR-BenはPRMの性能を2つの主要な推論領域(科学と論理)と9つの分野にまたがって評価するプロセスレベルのベンチマークである。数学的推論以外の領域では、既存のPRMとLSMの誤差検出能力は比較によって著しく弱い。一般に、PLMは知識に基づく誤りを識別する能力が低いが、LSMは計算誤差を検出する性能が劣っている。
参考スコア（独自算出の注目度）: 55.56903314809719
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors. We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.
Abstract（参考訳）: 現在、プロセス報酬モデル(PRM)は、テストタイムスケーリングに顕著な可能性を示している。大規模言語モデル(LLM)は、幅広い推論や意思決定タスクに対処する際に、欠陥のある中間推論ステップを定期的に生成するため、実世界のシナリオでプロセスレベルのエラーを検出する能力を持つ必要がある。しかし、既存のベンチマークは主に数学的推論に焦点を当てており、様々な推論シナリオにわたるPRMのエラー検出能力を包括的に評価することができない。このギャップを軽減するために,2つの主要な推論領域(科学と論理)と9つのサブドメインにまたがるPRMの性能を評価するためのプロセスレベルのベンチマークであるGR-Benを紹介する。 PRMとLLMの両方を包含する22種類のモデルの多種多様な実験を行い、(1)数学的推論以外の領域では、既存のPRMとLLMの誤り検出能力は比較によって著しく弱いことが判明した。 2) 一般に, PRMは知識に基づく誤りの同定に適していないが, LLMは計算誤差の検出性能が劣っている。 GR-Benは、一般領域におけるPRMの研究を後押しし、LCMの推論能力を高めることを願っている。

論文の概要: GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

関連論文リスト