Fugu-MT 論文翻訳(概要): From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

論文の概要: From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

arxiv url: http://arxiv.org/abs/2604.04648v1
Date: Mon, 06 Apr 2026 12:58:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.194201
Title: From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism
Title（参考訳）: 好奇心から注意:ペシミズムによるベスト・オブ・Nのハッキングの軽減
Authors: Zhuohao Yu, Zhiwei Steven Wu, Adam Block,
Abstract要約: 我々は,BoNサンプリングにおける報酬ハッキングを著しく軽減する,単純で計算効率のよいアプローチであることを示す。また、単純化された線形設定の理論解析を行い、標準的なBoNアプローチよりも注意が確実に改善されることを示す。
参考スコア（独自算出の注目度）: 30.96634743446629
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL, which uses lower confidence bounds on value estimates to avoid OOD actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.
Abstract（参考訳）: 推論時間計算のスケーリングは、幅広いタスクにおいて言語モデルのパフォーマンスを改善するための強力なパラダイムとして現れてきたが、追加の計算をどのように使うのが最適かという問題は未解決のままである。一般的なアプローチは、N候補応答が生成され、報酬モデルに従ってスコアされ、最高スコア応答が選択されるBoNサンプリングである。このアプローチは、パフォーマンスを向上させることができるが、真に生成品質を改善するのではなく、報酬モデルにおける不完全性を利用する応答の選択により、Nが増加するにつれてパフォーマンスが低下する報奨ハッキングには脆弱である。報酬のハッキングを軽減しようとする以前の試みは、強力な報酬モデルや重み付けの分散正規化を通じて、過度な最適化に完全に対処できなかったり、追加の計算を利用するには保守的すぎる。本研究では,不確実な報酬推定を伴うOOD行動を回避するために,評価値の信頼度を低くするRLにおける悲観主義の原理を考察する。好奇心が予測誤差を新しい信号として報いる場合、注意は分布の不確実性の信号として予測誤差を罰する。実際、警告は典型的な応答でエラーモデルを訓練し、その予測誤差を使用して非典型的応答に対する報酬推定を下げる。大規模な実証実験により,BoNサンプリングにおける報酬ハッキングを著しく軽減する,簡単な,計算効率のよいアプローチであることが確認された。また、単純化された線形設定の理論解析を行い、標準的なBoNアプローチよりも注意が確実に改善されることを示す。本研究の結果は,ハッキングを報奨する実用的な解決策としてだけではなく,好奇心に基づくアプローチがLDM設定における一般的なOOD検出手法であることを示すものである。

論文の概要: From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

関連論文リスト