Fugu-MT 論文翻訳(概要): Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

論文の概要: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

arxiv url: http://arxiv.org/abs/2510.01367v1
Date: Wed, 01 Oct 2025 18:49:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.829879
Title: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Title（参考訳）: 思考か熱か? Reasoning Effort 測定による不必要なReward Hackingの検出
Authors: Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He,
Abstract要約: Reward Hackingは、推論モデルが報酬関数の抜け穴を利用して、目的のタスクを解決せずに高い報酬を達成する。暗黙の報酬ハッキングを検出するため,TRACE(Truncated Reasoning AUC Evaluation)を提案する。私たちのキーとなる観察は、実際のタスクを解くよりも、抜け穴を悪用した場合にハッキングが発生するということです。
参考スコア（独自算出の注目度）: 44.34183850072512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model's chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less `effort' than required to achieve high reward. TRACE quantifies effort by measuring how early a model's reasoning becomes sufficient to pass a verifier. We progressively truncate a model's CoT at various lengths, force the model to answer, and measure the verifier-passing rate at each cutoff. A hacking model, which takes a shortcut, will achieve a high passing rate with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
Abstract（参考訳）: Reward Hackingは、推論モデルが報酬関数の抜け穴を利用して、目的のタスクを解決せずに高い報酬を達成する。この振る舞いは明示的であり、すなわちモデルのチェーン・オブ・シント(CoT)で言語化されたり、あるいは暗黙的であったりする。暗黙の報酬ハッキングを検出するため,TRACE(Truncated Reasoning AUC Evaluation)を提案する。私たちのキーとなる観察は、実際のタスクを解くよりも、抜け穴を悪用した場合にハッキングが発生するということです。これは、モデルが高い報酬を達成するために要求されるよりも「努力」が少ないことを意味する。 TRACEは、モデルの推論がバリデーションを通過するのにどれくらい早いかを測定することで、労力を定量化する。我々は、モデルのCoTを様々な長さで徐々に切り離し、モデルに応答を強制し、各カットオフで検証器通過率を測定する。ショートカットを行うハックモデルは、CoTのごく一部で高い通過率を達成し、精度-vs長曲線の下で大きな面積を生み出す。 TRACEは、数学の推論で最強の72B CoTモニターで65%以上、コーディングで32Bモニターで30%以上上昇しています。さらに、TRACEはトレーニング中に未知の抜け穴を発見できることを示す。 TRACEは、現在の監視方法が有効でない場合の監視にスケーラブルで教師なしのアプローチを提供する。

論文の概要: Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

関連論文リスト