Fugu-MT 論文翻訳(概要): Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

論文の概要: Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

arxiv url: http://arxiv.org/abs/2602.14069v1
Date: Sun, 15 Feb 2026 09:39:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-17 14:17:28.6371
Title: Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric
Title（参考訳）: Open Rubric System: Pairwise Adaptive Rubricによる強化学習のスケールアップ
Authors: Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang,
Abstract要約: スカラー報酬モデルでは、多次元の人間の嗜好を1つの不透明スコアに圧縮する。プラグ・アンド・プレイのルーブリックベースのLLM-as-a-JudgeフレームワークであるOpen System(OpenRS)を紹介する。 OpenRSは明示的なメタルブリックを使用します -- ガバナンスがどのようにインスタンス化され、重み付けされ、強制されるかという、コンスティチューションのような仕様です。
参考スコア（独自算出の注目度）: 10.220923271217632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.
Abstract（参考訳）: スカラー報酬モデルは、多次元の人間の嗜好を1つの不透明なスコアに圧縮し、しばしばオープンエンドアライメントにおける脆さと報酬のハッキングにつながる情報のボトルネックを生成する。非検証可能なタスクに対するロバストなアライメントは原則的一般化問題であり、報酬は審査員に内部化される学習関数であってはならないが、検査可能な原則の下で実行される明示的な推論プロセスである。提案するOpen Rubric System(OpenRS)は,Pairwise Adaptive Meta-Rubrics(PAMR)と軽量なPointwise Verifiable Rubrics(PVR)を中心に構築された,プラグアンドプレイでルーリックベースのLCM-as-a-Judgeフレームワークである。 OpenRSは明示的なメタルブリック-ルブリック-ルブリックのインスタンス化、重み付け、強制の方法を規定する構成的な仕様-を使っており、二つの候補の応答のセマンティックな違いを条件にすることで、適応ルブリックをオンザフライでインスタンス化する。その後、クレーター単位でのペアワイズ比較を行い、クレーターレベルの嗜好を外部に集約し、ポイントワイドなスカラー化を回避し、オープンエンド環境での識別性を向上させる。諸領域にまたがって整合性を維持しつつ編集可能であるために,2段階のメタルブリック改良パイプライン(一般原理の進化的洗練と,ドメイン原理の再現可能なヒューマン・イン・ザ・ループ・プロシージャ)を導入し,デジェネレーションに対するガードレールとして機能し,目的のサブタスクに対する検証可能な報酬源を兼ね備えた。最後に、ペアワイズRLトレーニングにおける報酬管理としてOpenRSをインスタンス化する。

論文の概要: Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

関連論文リスト