Fugu-MT 論文翻訳(概要): EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

論文の概要: EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

arxiv url: http://arxiv.org/abs/2606.04145v1
Date: Tue, 02 Jun 2026 19:03:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.334317
Title: EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
Title（参考訳）: EvalStop:マルチテナントRLHFプラットフォームにおけるワールドフィードバックによる逆オーバー最適化の検出と修正
Authors: Guilin Zhang, Chuanyi Sun, Shahryar Sarkani, John M. Fossaceca,
Abstract要約: クラウドファインチューニングプラットフォームは、学習された報酬モデルが人間の品質のプロキシとして最適化される、RLHFワークロードにますます役立ちます。 EvalStopは、k連続のevalスコアダウンのジョブを終了し、GPUをリリースし、最高のチェックポイントを保持し、ベーススケジューラに委譲する。 RLHF重負荷では、EvalStopは精度98%/リコール99%/FPR 1.5%を実現し、JCTを9%改善し、SRTF-Estを22%削減した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).
Abstract（参考訳）: クラウドLLMファインチューニングプラットフォームは、学習された報酬モデルが人間の品質のプロキシとして最適化されるように、RLHFワークロードを提供するようになってきている。 Gao et al (2023) が示したように、このプロキシは、継続的な最適化圧力の下で世界フィードバック(下流のevalメトリック)から分岐する。既存のプラットフォームスケジューラはこの違いを無視している。非サーボスケジューラは品質信号なしでJCTを最適化し、SLAQスタイルのクオリティアスケジューラはトレーニング損失(ハッキングによってモノトニックにドロップする弱いプロキシ)を使用し、古典的な1ジューブの早期停止は人間の監視を必要とし、共有GPUは無料である。コンポーザブルなスケジューリングプリミティブであるEvalStopを提案する。このプリミティブは、連続するevalスコアの低下を解消し、GPUをリリースし、最適なチェックポイントを保持し、ベーススケジューラに委譲する。我々は、スケジューラレベルの早期停止を検知問題とし、RLHFの作業負荷が報酬ハックと構造的に健全なランニングを混合し、スケジューラから接地トラストラベルを隠蔽した離散イベントシミュレータで評価する。 RLHF重負荷(RLHF、64GPU)では、EvalStopは精度98%/リコール99%/FPR 1.5%を実現し、JCTを9%改善し、SRTF-Estを22%削減した(p<0.05)。トライバイアルの固定プログレスとロスプレートの競合は、健康なRLHFで65%のFPRを発生させるか、真のハッキング事件の半分以上を見逃す。ゲインはテスト対象のベーススケジューラ(9-25% JCT)で構成され、検出品質はevalノイズ(ノイズstd <= 0.05で少なくとも91%の精度)とハッキングベースレート(20-80%のハッキング率で少なくとも89%の精度)の下で安定している。

論文の概要: EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

関連論文リスト