Fugu-MT 論文翻訳(概要): Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

論文の概要: Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

arxiv url: http://arxiv.org/abs/2606.11387v1
Date: Tue, 09 Jun 2026 19:10:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.154762
Title: Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
Title（参考訳）: 小実験, チーパー決定: マイクロプレトレーニングの段階的促進を事例として
Authors: Felipe Chavarro Polania,
Abstract要約: 我々は, Windows A100 と Linux L40S の2つの異種ホストブロック上で, 固定マイクロプレトレーニングランナに対して, 監査可能なステージングプロモーションプロトコルについて検討した。 2分、5分、10分、60分、12時間という段階的な予算を使います。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.
Abstract（参考訳）: 短時間の事前トレーニングは実験コストを削減できるが、小さな予算でしか見えないオーバープロモート構成も可能だ。我々は, Windows A100 と Linux L40S の2つの異種ホストブロック上で, 固定マイクロプレトレーニングランナに対して, 監査可能なステージングプロモーションプロトコルについて検討した。 12の事前画面構成から始めて、2分、5分、10分、60分、12時間というステージ化された予算を使います。 5分間と10分間のランキングはホストに敏感であり、最終的に12時間のトップランクの条件は複製された10分間のゲートにおける平均ベスト条件ではない。種の範囲はステージによって異なるため、これらの変化は、種内曲線ではなく、運用上のプロモーションの証拠である。複製された60分間のゲートは、ステージド・ファクター・スクリーニング・ブリッジの参照をプロモーションセットに保持し、4つの60分間のホストシード・セルで第1位にランク付けする。最後の12時間確認パッケージにおいて、ブリッジ条件は、2つの種子にまたがる4つの宿主種子細胞で第1位にランクされ、グリーディコンパレータは、凍結した0.010 val_bpb近傍等価規則を満たしず、安価なd8/ar48(depth-8, aspect-48)センチネルは、凍結した0.020平均ギャップ規則を満たしない。 12時間ブランチの実行時間は144GPU時間であり、フルステージプロトコルはスクリーニングステージを含む169.2トレーニングGPU時間を記録している。続く4つの60分候補は192GPU時間、続く9つの10分候補は432GPU時間である。後者の数字は、欠席した候補者が参照を乗っ取らなかったという証拠ではない。その結果は、大域的最適性、キャパシティ正規化優越性、適応的ハイパーパラメータ最適化法よりも優越性の主張ではなく、有界なコスト割当探索である。

論文の概要: Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

関連論文リスト