Fugu-MT 論文翻訳(概要): Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

論文の概要: Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

arxiv url: http://arxiv.org/abs/2605.13652v2
Date: Tue, 19 May 2026 00:27:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.279306
Title: Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Title（参考訳）: 難易度を超えて:低ランクプレトライニングの幾何学的および分光学的研究
Authors: Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky,
Abstract要約: 検証の難易度が近い場合でも,低ランクの手法はフルランクの訓練と同等ではなく,互いに同等ではないことを示す。低ランクアクティベーションは、トレーニングが進むにつれて、後層のフルランクから分岐し、GaLoreはフルランクを追跡する。
参考スコア（独自算出の注目度）: 11.118638230247951
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.
Abstract（参考訳）: 事前トレーニングされた大きな言語モデルは、フルランクの重み、勾配、オプティマイザ状態を格納するメモリコストに支配されている。低ランク事前学習はこの問題に対処するために現れており、メソッドの空間は急速に成長している。低ランクのメソッドは、フルランクのトレーニングに相容れない一般化を行うモデルを生成するのか、それとも、ランク制約が到達した解を根本的に変えるのか? 既存の比較は、ほとんど完全に単座実行による検証の難易度に依存しており、しばしば以前の文献から先延ばしされる。 2つの手法は、異なるロスランドスケープ領域と内部表現に収束しながら、パープレキシティにマッチする。 5つの低ランク事前学習手法、GaLoreとFira(メモリ効率の最適化)、CoLAとSLTrain(アーキテクチャのパラメータ化)、ReLoRA(周期的なリセットを伴うアダプタスタイルの更新)による3つのモデルスケール(60M, 130M, 350M)のフルランクトレーニングに対するソリューションを特徴付けることで、このギャップを埋める。ランダム/トップKPCA方向の1-Dロスランドスケープ,チェックポイント間の1-D補間,重みと学習した更新のスペクトル構造,フルランクトレーニングとのアクティベーション類似性,の4次元にわたる16つの指標について評価した。検証の難易度が近い場合でも,低ランクの手法はフルランクの訓練と同等ではなく,互いに同等ではないことを示す。フルランクトレーニングはランダムな方向に沿って低ランクメソッドよりもシャープな盆地に落ち着き、逆はトップ1PCA方向を保っている。それぞれの方法は幾何学的に異なる盆地に収束する。低ランクアクティベーションは、トレーニングが進むにつれて、後層のフルランクから分岐し、GaLoreはフルランクを追跡する。さらに、検証の難易度は、すべてのスケールでダウンストリームのパフォーマンスに変換されない。幾何学的およびスペクトル的メトリクスを追加することで、予測が改善される。

論文の概要: Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

関連論文リスト