Fugu-MT 論文翻訳(概要): Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling

論文の概要: Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling

arxiv url: http://arxiv.org/abs/2510.17314v1
Date: Mon, 20 Oct 2025 09:01:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.377529
Title: Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling
Title（参考訳）: オートルーブリック:リワードモデリングのための一般化可能な基準を抽出する学習
Authors: Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, Bolin Ding,
Abstract要約: リワードモデルは、大規模言語モデルと人間の価値の整合に不可欠であるが、その開発はコストのかかる選好データセットと低い解釈可能性によって妨げられている。検証誘導型 textbfPropose-Evaluate-Revise パイプラインを用いて,高品質でクエリ固有のルーリックを推論する,トレーニング不要のフレームワークを構築した。わずか70の選好ペア(ソースデータの1.5%)を使用することで、Qwen3-8Bのようなより小型のモデルでも、専門的で完全に訓練されたモデルよりも優れた性能を発揮する。
参考スコア（独自算出の注目度）: 37.237020102873
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability. We address these limitations with a novel, training-free framework built on a key assumption: \textit{evaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries}, a property that enables remarkable data efficiency. Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided \textbf{Propose-Evaluate-Revise} pipeline. Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an \textbf{information-theoretic coding rate}. The final output is an interpretable, hierarchical "Theme-Tips" rubric set. Extensive experiments demonstrate the framework's exceptional data efficiency and performance. Critically, using just 70 preference pairs (1.5\% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts. This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.
Abstract（参考訳）: リワードモデルは、Large Language Models(LLM)と人的価値の整合に不可欠であるが、その開発はコストのかかる選好データセットと低い解釈可能性によって妨げられている。最近のルーリックベースのアプローチは透明性を提供するが、しばしば体系的な品質管理と最適化が欠如しており、スケーラビリティと信頼性のトレードオフを生み出している。人間の嗜好に根ざした‘textit{evaluation rubrics’は、データ効率を著しく向上させる特性である多様なクエリーにまたがる大きな一般化能力を示す。我々の2段階のアプローチはまず、バリデーション誘導された \textbf{Propose-Evaluate-Revise} パイプラインを用いて、高品質でクエリ固有のルーリックを推論する。第二に、これらの粒状ルーブリックをコンパクトで非冗長なコア集合に一般化し、 \textbf{information-theoretic coding rate} を最大化する。最後の出力は解釈可能で階層的な"Theme-Tips"ルーリック集合である。大規模な実験では、フレームワークの例外的なデータ効率とパフォーマンスが実証されている。重要な点として、70組の選好ペア(ソースデータの1.5パーセント)を使用することで、Qwen3-8Bのような小さなモデルでも、専門的で完全に訓練されたモデルよりも優れたパフォーマンスを実現できる。この作業は、報酬モデリングのためのスケーラブルで解釈可能で、データ効率のよいパスの先駆者です。

論文の概要: Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling

関連論文リスト