Fugu-MT 論文翻訳(概要): The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

論文の概要: The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

arxiv url: http://arxiv.org/abs/2604.22778v1
Date: Fri, 03 Apr 2026 08:58:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 02:32:14.173555
Title: The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Title（参考訳）: 変圧器訓練のスペクトルライフサイクル:過渡圧縮波、持続スペクトル勾配、Q/K-V非対称性
Authors: Yi Liu,
Abstract要約: 本稿では,重み行列特異値スペクトルの事前学習に関する最初の体系的研究について述べる。 textbf(1) 過渡圧縮波: 安定な階数圧縮は、初期から後期の進行波として伝播する。 textbf(2) Persistent Spectral Gradients: the power-law exponent$$s developing a permanent depth gradient formed a non-monotonic inverted-U in deep model。
参考スコア（独自算出の注目度）: 4.28787537081191
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compress past early layers. \textbf{(2)~Persistent Spectral Gradients:} the power-law exponent~$α$ develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf{(3)~Q/K--V Functional Asymmetry:} value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that \emph{rank and spectral shape encode fundamentally different information about training}. We formalize this as a two-timescale dynamical model and derive scaling laws ($Δα\propto L^{0.26}$, $R^2{=}0.99$). We validate on nine models across three families (custom, GPT-2, Pythia; 30M--1B parameters; 8--36 layers), demonstrate that $α$ predicts layer importance ($ρ{=}0.69$--$0.84$, $p{<}0.02$), and show that spectral-guided pruning outperforms Last-N heuristics by $1.1{\times}$--$3.6{\times}$ across seven models in two families (GPT-2 124M--774M, Pythia 160M--1B), with worst-vs-best gaps up to $23.7{\times}$ confirming the causal role of spectral structure.
Abstract（参考訳）: 重み行列特異値スペクトル \emph{during} トランスフォーマーの事前学習, および3つのモデルスケール(30M--285Mパラメータ)にわたる25ステップ間隔での全ての重み行列の完全なSVD分解の追跡に関する最初の体系的研究を行う。 \textbf{(2)~Persistent Spectral Gradients: {\displaystyle \textbf{(2)~Persistent Spectral Gradients:} パワーロー指数~$α$ は、深いモデルにおいて非単調逆Uを形成する永続的な深さ勾配を発達させ、深さが増加するにつれてピークは以前の層へとシフトする。 \textbf{(3)~Q/K-V 関数的非対称性:} 値/出力プロジェクションは一様に圧縮され、クエリ/キープロジェクションは完全な深さ依存のダイナミクスを持つ。過渡的な圧縮と持続的なスペクトル形状の解離は、'emph{rank'とスペクトル形状が訓練に関する根本的に異なる情報をエンコードしていることを示している。これを2時間スケールの力学モデルとして定式化し、スケーリング法則(Δα\propto L^{0.26}$, $R^2{=}0.99$)を導出する。我々は、3つのファミリー(custom, GPT-2, Pythia; 30M--1B parameters; 8-36 layer)の9つのモデルについて検証し、$α$が層の重要性(ρ{=}0.69$-$0.84$, $p{<}0.02$)を予測し、スペクトル誘導プルーニング性能(Last-N heuristics by $1.1{\times}$-$3.6{\times}$--$3.6{\times} 2つのファミリー(GPT-2 124M--774M, Pythia 160M--1B)にまたがる7つのモデル(GPT-2 124M--774M, Pythia 160M--1B)で、最悪のvs-bestギャップが2,3.7{\times}$のスペクトル構造の因果関係を検証していることを示した。

関連論文リスト

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression [0.0]
2つのシーケンスタスクにおいて、スペクトルエッジをグラクキング中にその勾配と重み劣化成分に分解する。急激な2相のライフサイクルを見出す: エッジをグルーキングする前に、エッジは普遍性駆動され、機能的にアクティブである; グルーキングでは勾配と重みが整列し、エッジは圧縮軸となる。
論文参考訳（メタデータ） (2026-04-08T01:57:04Z)
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason [4.28787537081191]
大規模言語モデルでは, 推論と事実的リコールに係わる場合, 隠れたアクティベーション空間において, 音声の位相遷移が現れることがわかった。我々は、トランスフォーマーにおける推論の包括的遠近法理論を確立し、思考の幾何学が方向性において普遍的であり、アーキテクチャに特有であり、結果の予測であることを示した。
論文参考訳（メタデータ） (2026-04-03T09:18:57Z)
Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales [0.0]
コンヒーレントな方向のみにおいてトランスフォーマー訓練軌道が発展することを示す。共同作業では、同じスペクトル幾何学がグラッキングの早期警戒信号を提供する。
論文参考訳（メタデータ） (2026-03-14T04:46:05Z)
Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging [54.172416732517156]
ランゲヴィン力学は、前よりも平均的な反復を考えると、$n gtrsim d kstar/2 $サンプルで成功する。前者よりも平均的な反復を考えると、ランゲヴィン力学が$n gtrsim d kstar/2 $サンプルで成功することを示す。
論文参考訳（メタデータ） (2026-03-06T08:23:24Z)
Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds [0.5729426778193398]
幾何学的および統計的物理レンズを用いた深部変圧器言語モデルにおける多段階推論の出現について検討する。我々は、フォワードパスを離散粗粒度写像として形式化し、安定な「概念盆地」の出現と、この再正規化のような力学の固定点を関連付ける。結果として生じる低エントロピー状態は、スペクトルテール崩壊と、表現空間における過渡的で再利用可能なオブジェクトのような構造の形成によって特徴づけられる。
論文参考訳（メタデータ） (2026-01-16T23:11:02Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
現代のスケール不変アーキテクチャでは、トレーニングは急速に劣化したグラデーション状態に入る。我々は,AdamWに対して,幅をまたいだサブ層ゲインを保ったウェイトデカイスケーリングルールを導入する。この結果は,パラメータが設定した定常スケールを明示的に制御することにより,ほぼ入出力体制を超えて$mu$Pを拡大する。
論文参考訳（メタデータ） (2025-10-17T02:58:35Z)
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility [90.894232610821]
我々は、ランク構造のレンズを通してトランスフォーマーを解析する。時系列埋め込みは急激な減衰特異値スペクトルを示すことを示す。関連する$Q/K/V$プロジェクションが正確な低ランク近似を持つことを示す。
論文参考訳（メタデータ） (2025-10-02T23:56:17Z)
KPZ scaling from the Krylov space [83.88591755871734]
近年,Cardar-Parisi-Zhangスケーリングをリアルタイムの相関器や自動相関器に示す超拡散が報告されている。これらの結果から着想を得て,Krylov演算子に基づく相関関数のKPZスケーリングについて検討する。
論文参考訳（メタデータ） (2024-06-04T20:57:59Z)
On Biased Compression for Distributed Learning [55.89300593805943]
バイアス圧縮機が単一ノードと分散設定の両方において線形収束率をもたらすことを初めて示す。理論的保証と実用性能を期待できる新しいバイアス圧縮機を提案する。
論文参考訳（メタデータ） (2020-02-27T19:52:24Z)
Minimum optical depth multiport interferometers for approximating arbitrary unitary operations and pure states [37.69303106863453]
我々は、マルチポート干渉計を用いて、純状態の準備とユニタリ操作を与えられた不忠実さで近似する問題に対処する。数値計算により、任意の次元$d$の純粋な状態が不忠実で作成できることが示される。
論文参考訳（メタデータ） (2020-02-04T15:40:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。