Fugu-MT 論文翻訳(概要): Parcae: Scaling Laws For Stable Looped Language Models

論文の概要: Parcae: Scaling Laws For Stable Looped Language Models

arxiv url: http://arxiv.org/abs/2604.12946v1
Date: Tue, 14 Apr 2026 16:43:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.566035
Title: Parcae: Scaling Laws For Stable Looped Language Models
Title（参考訳）: Parcae: 安定的なループ言語モデルのスケーリング法則
Authors: Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Daniel Y. Fu,
Abstract要約: 従来の固定深度アーキテクチャは、FLOPのトレーニングを増やすことで、通常、より高いメモリフットプリントやデータを犠牲にして、パラメータ化を増やすことで、品質をスケールする。潜在的に代替となるのがループアーキテクチャであり、ループ内のレイヤブロックを通じてアクティベーションを送信することでFLOPを増大させる。有望ではあるが、ループ化されたアーキテクチャをトレーニングするための既存のレシピは不安定になり、残余の爆発と損失のスパイクに悩まされる。本稿では, 負の対角パラメータ化の離散化により, 射出パラメータのスペクトルノルムを制約する新しい安定ループアーキテクチャであるParcaeを提案する。
参考スコア（独自算出の注目度）: 35.9547796403241
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.
Abstract（参考訳）: 従来の固定深度アーキテクチャは、FLOPのトレーニングを増やすことで、通常、より高いメモリフットプリントやデータを犠牲にして、パラメータ化を増やすことで、品質をスケールする。潜在的に代替となるのがループアーキテクチャであり、ループ内のレイヤブロックを通じてアクティベーションを送信することでFLOPを増大させる。有望ではあるが、ループ化されたアーキテクチャをトレーニングするための既存のレシピは不安定になり、残余の爆発と損失のスパイクに悩まされる。残ストリーム上の非線形時変力学系としてループを再キャストすることで,これらの課題に対処する。この系に対する線形近似により, 入射パラメータのスペクトルノルムが大きくなった結果, 既存のループアーキテクチャに不安定が生じていることが判明した。これらの不安定性問題に対処するため、我々は、負の対角パラメータ化の離散化によって射出パラメータのスペクトルノルムを制約する、新しい安定ループアーキテクチャであるParcaeを提案する。その結果、Parcaeは以前の大規模ループモデルよりも最大6.3%低い検証難易度を実現している。安定なループアーキテクチャを用いて,学習時間やテスト時間におけるFLOPの増加による品質向上を図るため,ループのスケーリング特性を媒介として検討する。トレーニングでは、パラメータ数を一定に保ちながらFLOPをスケールする予測可能なパワー法則を導出する。最初のスケーリング法則は、固定のFLOP予算を考えると、ループとデータがタンデムで増加するべきであることを示唆しています。テスト時に、Parcaeは、予測可能で飽和する指数的崩壊に従って、計算をスケールするためにループを使用することができる。パラメータを1.3Bに拡大すると、ParcaeはCOREとCore-Extendedの品質を2.99と1.18ポイント改善し、固定パラメータとデータ予算の下で強力なTransformerベースラインと比較し、Transformerの相対品質を最大87.5%向上した。

論文の概要: Parcae: Scaling Laws For Stable Looped Language Models

関連論文リスト