Fugu-MT 論文翻訳(概要): Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

論文の概要: Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

arxiv url: http://arxiv.org/abs/2605.29648v1
Date: Thu, 28 May 2026 09:14:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.100737
Title: Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering
Title（参考訳）: 数学とコードを超えた検証可能なリワード: ファクチュアルな質問応答のための軽量コーパスのプロセススーパービジョン
Authors: Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng,
Abstract要約: 本稿では,ニューラル検証をウィキペディアのコーパスグラウンド信号に置き換える,軽量でプラグイン対応のプロセス報酬を提案する。命令チューニングされた6つのモデルにまたがる30のセル(モデル、ベンチマーク)で、CorVerはすべてのセルの生のベースラインを改善している。また、20細胞中18細胞において4つの神経検証基線を許容可能な構成で上回り、4.8から8.4倍の速さで訓練する。
参考スコア（独自算出の注目度）: 44.82662196757139
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.
Abstract（参考訳）: 知識集約型質問応答における事実精度向上のための強化学習の適用は、報酬設計のジレンマに直面している。応答レベルの報酬は、粗い監視のみを提供し、推論トレース内の不正なステートメントと正しく区別できない。文レベルの代替案はよりきめ細かいフィードバックを提供するが、典型的にはNLI検証器、LLM審査員、あるいはRLスケールで展開するのに高価であり、正確な報酬信号が特に重要である希少な事実に対して信頼できない知識検証パイプラインに依存している。我々はCorVer(Corpus Verify)を提案する。CorVerは、ニューラルネットワーク検証をウィキペディアの共起統計に基づくコーパスグラウンド信号に置き換える軽量でプラグイン対応のプロセス報酬である。 CorVerは文単位のクレジットを割り当て、単純なアライメントによってトークン単位のアドバンテージにマップする。 6つの命令チューニングモデル(3Bから14B)と5つのQAベンチマークにまたがる30のセル(モデル、ベンチマーク)で、CorVerは各セルのベースラインを改良し、平均的なTriviaQAゲインは+4.1ppである。また、20細胞中18細胞において4つの神経検証基線を許容可能な構成で上回り、4.8から8.4倍の速さで訓練する。

論文の概要: Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

関連論文リスト