Fugu-MT 論文翻訳(概要): Robust Reward Modeling via Causal Rubrics

論文の概要: Robust Reward Modeling via Causal Rubrics

arxiv url: http://arxiv.org/abs/2506.16507v1
Date: Thu, 19 Jun 2025 17:59:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-23 19:00:05.207659
Title: Robust Reward Modeling via Causal Rubrics
Title（参考訳）: 因果ルーブリックを用いたロバストリワードモデリング
Authors: Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup,
Abstract要約: リワードモデル(RM)は、人間のフィードバックによってLLM(Large Language Models)を整列させるのに基本的だが、報酬のハッキングに悩まされることが多い。 Creeは、報酬のハッキングを軽減するために設計された明確な因果モデルに基づく、新しいフレームワークである。 CreeはRewardBenchの標準ベースラインを大幅に上回り、平均精度を最大5.4%向上させ、特定のカテゴリーで最大13.2%と7.2%のゲインを達成した。
参考スコア（独自算出の注目度）: 46.35051816438772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models (RMs) are fundamental to aligning Large Language Models (LLMs) via human feedback, yet they often suffer from reward hacking. They tend to latch on to superficial or spurious attributes, such as response length or formatting, mistaking these cues learned from correlations in training data for the true causal drivers of quality (e.g., factuality, relevance). This occurs because standard training objectives struggle to disentangle these factors, leading to brittle RMs and misaligned policies. We introduce Crome (Causally Robust Reward Modeling), a novel framework grounded in an explicit causal model designed to mitigate reward hacking. Crome employs the following synthetic targeted augmentations during training: (1) Causal Augmentations, which are pairs that differ along specific causal attributes, to enforce sensitivity along each causal attribute individually, and (2) Neutral Augmentations, which are tie-label pairs varying primarily in spurious attributes, to enforce invariance along spurious attributes. Notably, our augmentations are produced without any knowledge of spurious factors, via answer interventions only along causal rubrics, that are identified by querying an oracle LLM. Empirically, Crome significantly outperforms standard baselines on RewardBench, improving average accuracy by up to 5.4% and achieving gains of up to 13.2% and 7.2% in specific categories. The robustness of Crome is further testified by the consistent gains obtained in a Best-of-N inference setting across increasing N, across various benchmarks, including the popular RewardBench (covering chat, chat-hard, safety, and reasoning tasks), the safety-focused WildGuardTest, and the reasoning-specific GSM8k.
Abstract（参考訳）: リワードモデル(RM)は、人間のフィードバックによってLLM(Large Language Models)を整列させるのに基本的だが、報酬のハッキングに悩まされることが多い。それらは、応答長やフォーマッティングのような表面的または刺激的な属性に、品質の真の因果的ドライバ(例えば、事実性、関連性)のトレーニングデータから学んだこれらの手がかりを間違える傾向がある。これは、標準的な訓練目標がこれらの要因を解き放つのに苦労し、不安定なRMと不整合ポリシーにつながるためである。 Crome(Causally Robust Reward Modeling)は、報酬ハッキングを緩和するために設計された明確な因果モデルに基づく、新しいフレームワークである。 1)特定の因果属性に沿って異なるペアである因果拡張(Causal Augmentations)は、それぞれの因果属性に沿って個別に感度を強制する。特に、我々の増補は、オラクルのLSMを問い合わせることによって識別される因果的ルーリックにのみ答える介入を通じて、刺激的な要因の知識を伴わずに生成される。実証的には、CromeはRewardBenchの標準ベースラインを大幅に上回り、平均精度を最大5.4%向上させ、特定のカテゴリーで最大13.2%と7.2%の上昇を達成した。 Cromeの堅牢性は、人気の高いRewardBench(チャット、チャットハード、安全性、推論タスクをカバー)、安全性を重視したWildGuardTest、推論固有のGSM8kなど、Nの増加にまたがるBest-of-N推論設定で得られた一貫性のあるゲインによってさらに証明されている。

論文の概要: Robust Reward Modeling via Causal Rubrics

関連論文リスト