Fugu-MT 論文翻訳(概要): Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF

論文の概要: Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF

arxiv url: http://arxiv.org/abs/2509.24713v1
Date: Mon, 29 Sep 2025 12:42:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.982596
Title: Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF
Title（参考訳）: RLHFの長期ロバスト性を考慮した回路認識リワードトレーニング
Authors: Jing Liu,
Abstract要約: 本稿では,報酬モデルにおけるレアイベント処理に責任を負う特殊なニューラルネットワークを同定する機械的解釈可能性フレームワークを提案する。我々の理論的枠組みは、回路特殊化、報酬一般化境界、ロングテール性能の間の形式的な接続を確立する。このアプローチは、報酬モデル失敗に関する理論的洞察と、長期的堅牢性を改善するための実践的な介入の両方を提供する。
参考スコア（独自算出の注目度）: 6.581088182267414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) reward models exhibit systematic failures on longtail distributions, leading to reward hacking and misalignment. We propose a mechanistic interpretability framework that identifies specialized neural circuits responsible for rare-event processing in reward models. Drawing from recent advances showing distributed specialization for rare tokens in language models\citep{liu2025no, liu2025emergent}, we hypothesize that reward models also develop functionally distinct circuits for longtail scenarios. Our theoretical framework establishes formal connections between circuit specialization, reward generalization bounds, and longtail performance. We introduce \textbf{Circuit-Aware Reward Training (CART)}, which uses circuit analysis to guide data augmentation, regularization, and ensemble strategies. This approach provides both theoretical insights into reward model failures and practical interventions for improving longtail robustness.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) 報奨モデルでは、長期分布の体系的な障害が示され、ハッキングや不正な調整に繋がる。本稿では,報酬モデルにおけるレアイベント処理に責任を負う特殊なニューラルネットワークを同定する機械的解釈可能性フレームワークを提案する。言語モデルにおける希少なトークンの分散特殊化を示す最近の進歩から,報酬モデルもまた,長期化シナリオのための機能的に異なる回路を開発するという仮説を立てる。我々の理論的枠組みは、回路特殊化、報酬一般化境界、ロングテール性能の間の形式的な接続を確立する。本稿では、回路解析を用いてデータ拡張、正規化、アンサンブル戦略をガイドする「textbf{Circuit-Aware Reward Training (CART)」を紹介する。このアプローチは、報酬モデル失敗に関する理論的洞察と、長期的堅牢性を改善するための実践的な介入の両方を提供する。

論文の概要: Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF

関連論文リスト