Fugu-MT 論文翻訳(概要): Compute-Optimal Scaling for Value-Based Deep RL

論文の概要: Compute-Optimal Scaling for Value-Based Deep RL

arxiv url: http://arxiv.org/abs/2508.14881v1
Date: Wed, 20 Aug 2025 17:54:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.543392
Title: Compute-Optimal Scaling for Value-Based Deep RL
Title（参考訳）: 値ベース深部RLのためのCompute-Optimal Scaling
Authors: Preston Fu, Oleh Rybkin, Zhiyuan Zhou, Michal Nauman, Pieter Abbeel, Sergey Levine, Aviral Kumar,
Abstract要約: オンライン価値ベースディープRLの計算スケーリングについて検討する。解析の結果,モデルサイズ,バッチサイズ,UTD間の微妙な相互作用が明らかになった。この現象を理解するためのメンタルモデルを提供し、バッチサイズとUTDを選択するためのガイドラインを構築します。
参考スコア（独自算出の注目度）: 96.33386443664929
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.
Abstract（参考訳）: モデルが大きくなり、トレーニングが高価になるにつれて、より大きなモデルやより多くのデータにトレーニングのレシピをスケールすることだけでなく、計算単位当たりの最大パフォーマンスを抽出する計算最適化的な方法で行うことがますます重要になる。このようなスケーリングは言語モデリングにおいてよく研究されているが、強化学習(RL)はこの点においてあまり注目されていない。本稿では,オンライン価値ベースディープRLにおける計算スケーリングについて検討する。これらの手法は、モデルキャパシティと更新データ(UTD)比の2つの主軸を示す。リソースをこれらの軸に分割してサンプル効率を最大化するにはどうすればよいのか? 解析の結果,モデルサイズ,バッチサイズ,UTD間の微妙な相互作用が明らかになった。特に、我々はTDオーバーフィッティング(TD-overfitting)と呼ぶ現象を特定し、バッチの増加は小さなモデルのQ関数精度を素早く損なうが、この効果は大規模モデルでは欠落しており、大規模なバッチサイズを効果的に活用することができる。この現象を理解するためのメンタルモデルを提供し、バッチサイズを選択するためのガイドラインと、計算使用量の最適化のためのUTDを構築します。本研究は, 深部RLにおける計算最適スケーリングの出発点として, 教師あり学習に適応するが, TD学習に適応する。

論文の概要: Compute-Optimal Scaling for Value-Based Deep RL

関連論文リスト