Fugu-MT 論文翻訳(概要): Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

論文の概要: Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

arxiv url: http://arxiv.org/abs/2606.05610v1
Date: Thu, 04 Jun 2026 02:32:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.499052
Title: Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
Title（参考訳）: LLM事前学習における最適ハイパーパラメータの予測可能なスケーリング法則
Authors: Yongwei Zhou, Juncheng Diao, Junlin Shang, Peiguang Li, Rongxiang Weng,
Abstract要約: 本稿では,所定のチェックポイントに対して,計算予算と最適ハイパーパラメータの関係を確立するための新しいフレームワークを提案する。提案手法は,高パラメータ探索のオーバーヘッドを最大90%削減すると同時に,ベースラインに対して同等あるいは優れた性能を実現する。このモデルに依存しないフレームワークはアーキテクチャをまたいで一般化し、様々な継続する事前学習シナリオに対して原則的かつ効率的な方法論を提供する。
参考スコア（独自算出の注目度）: 7.267441247692648
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.
Abstract（参考訳）: LLM(Large Language Models)の継続事前トレーニングの有効性は、学習率やバッチサイズなど、ハイパーパラメータの設定に依存している。しかし、現在のプラクティスは、しばしばヒューリスティックやグリッドサーチに依存し、トレーニングの不安定性と過剰なコストにつながる。本研究では, 最適パラメータが継続事前学習過程を通じて, 安定かつ予測可能なスケーリング法則に従うことを実証的に発見する。これらの知見を生かして、所定のチェックポイントに対して最適なハイパーパラメータと計算予算の量的関係を確立する新しい枠組みを提案する。提案手法は,(1) 計算予算を標準損失計算法により最適ハイパーパラメータにマッピングする関数を導出するために,小規模のプロキシモデルを訓練する,(2) チェックポイントの検証損失を評価し,逆スケーリング法を用いて,その‘textit{equivalent pre-training compute}’を推定する,という2段階からなる。これと計画された計算予算を組み合わせることで、ターゲットランに最適なハイパーパラメータを予測できる。実験により,提案手法は,ベースラインに対して同等あるいは優れた性能を達成しつつ,ハイパーパラメータ探索のオーバーヘッドを最大90%削減することを示した。このモデルに依存しないフレームワークは、アーキテクチャをまたいで一般化し、任意の時点から始まる様々な継続事前学習シナリオに対して、原則付きかつ効率的な方法論を提供する。

論文の概要: Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

関連論文リスト