Fugu-MT 論文翻訳(概要): Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

論文の概要: Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

arxiv url: http://arxiv.org/abs/2603.28534v1
Date: Mon, 30 Mar 2026 14:57:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.459763
Title: Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT
Title（参考訳）: 行列積演算子分解による変圧器言語モデル圧縮:PicoGPTを事例として
Authors: Younes Javanmard, Tanmoy Pandit, Masoud Mardani,
Abstract要約: トランスフォーマーベースの言語モデルは、NLPタスク間で強力なパフォーマンスを実現するが、その2次パラメータスケーリングは、リソース制約のあるハードウェアへのデプロイを高くする。変圧器の原理圧縮法として行列積演算子分解について検討する。 MPOは、重み行列を低ランクコアの鎖に分解し、近似品質は結合次元chiによって制御される。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.
Abstract（参考訳）: トランスフォーマーベースの言語モデルは、NLPタスク間で強力なパフォーマンスを実現するが、その2次パラメータスケーリングは、リソース制約のあるハードウェアへのデプロイを高くする。変圧器の原理圧縮法として, マトリックス製品演算子 (MPO) の分解について検討した。 MPOは、重み行列を低ランクコアの鎖に分解し、近似品質は結合次元chiによって制御される。約1MパラメータのGPT-2スタイルの文字レベル言語モデルであるPicoGPTのすべてのnn.Linear層をMPOチェインとしてパラメータ化したMPOLinearモジュールに置き換える。コアは、事前訓練された高密度重量からTT-SVDまたはランダム初期化から初期化され、カスタムバックワードパスなしで標準のPyTorchオートグレードを使用してトレーニングされる。我々は,PicoGPTの5つの異なる重み形状に対する平衡因子化スキームを導出し,Tiny Shakespeare 上の {4, 8, 16, 32} における結合次元chi を評価する。 MPO圧縮は、chi = 4でトランスブロック当たり最大13倍圧縮を達成する。 chi = 16では1020,224の代わりに191,872パラメータを使用し、ベースライントークンの精度は97.7%(51.6%対52.8%)である。レコンストラクションの誤差は予想される傾向に従い、同じ結合次元の2部位の分解よりも3部位の方が低い。 chi = 8 モデルはパラメータごとの最良の精度を与え、この計量で密度の高いベースラインを2.7倍超える。これらの結果から,MPOのパラメータ化は,変圧器圧縮のための低ランク法や非構造プルーニングに代わる実用的で理論的に基礎的手法であることが示唆された。

論文の概要: Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

関連論文リスト