Fugu-MT 論文翻訳(概要): A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models

論文の概要: A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models

arxiv url: http://arxiv.org/abs/2602.01613v1
Date: Mon, 02 Feb 2026 04:03:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.894582
Title: A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models
Title（参考訳）: 生産規模大規模言語モデルのための実用的なテンソル・ネットワーク圧縮パイプライン
Authors: Sergii Kozyrev, Davyd Maiboroda,
Abstract要約: Minimaは、Transformerの構造的圧縮の場所と方法を学ぶ、プロダクション圧縮パイプラインである。 Minimaは8k-tokenコンテキストウィンドウでQwen3-32B上で実行し、ピークVRAMを64 GiBから40 GiBに削減する。単一のアクティブリクエストでは、スループットは毎秒40トークン(ベースライン)から毎秒50トークン(ミニマ)、毎秒75トークン(推測復号化のミニマ)に向上する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are limited in deployment by GPU memory and inference latency. We present Minima, a production compression pipeline that learns where and how to structurally compress a Transformer and turns that compression into real serving gains. Minima trains a lightweight convolutional predictor to estimate layer- and patch-level sensitivity, applies a mixture of Tucker, tensor-train, and tensor-ring decompositions to low-sensitivity regions, performs a short healing fine-tune, and executes the resulting operators with custom Triton and CUDA kernels. The reduced memory footprint enables speculative decoding with a small draft model and a larger verifier. On Qwen3-32B at an 8k-token context window, Minima reduces peak VRAM from 64 GiB to 40 GiB. For a single active request, throughput increases from 40 tokens per second (baseline) to 50 tokens per second (Minima) and 75 tokens per second (Minima with speculative decoding). Under 50 parallel requests, throughput is 34, 44, and 53 tokens per second respectively, showing that Minima remains effective under high concurrency even when speculative decoding gains compress. We position Minima relative to recent tensor-network, low-rank plus quantization, and cross-layer sharing methods, and argue that it is a practical step toward more aggressive structural compression via shared tensor backbones with tiny per-layer adapters.
Abstract（参考訳）: 大規模言語モデルは、GPUメモリと推論レイテンシによるデプロイメントに制限がある。プロダクション圧縮パイプラインであるMinimaは、Transformerの構造的圧縮の場所と方法を学び、その圧縮を実際のサービスゲインに変換する。ミニマは、レイヤレベルの感度とパッチレベルの感度を推定するために軽量な畳み込み予測器を訓練し、タッカー、テンソルトレイン、テンソルリングの分解を低感度領域に適用し、短い修復細管を実行し、カスタムのトリトンとCUDAカーネルで演算子を実行する。メモリフットプリントの削減により、小さなドラフトモデルとより大きな検証器による投機的復号化が可能になる。 8kのコンテキストウィンドウのQwen3-32Bでは、MinimaはピークVRAMを64 GiBから40 GiBに削減する。単一のアクティブリクエストでは、スループットは毎秒40トークン(ベースライン)から毎秒50トークン(ミニマ)、毎秒75トークン(推測復号化のミニマ)に向上する。 50の並列要求の下では、スループットはそれぞれ34、44、53トークンであり、投機的復号化が圧縮された場合でも、Minimaは高い並列性の下で有効であることを示す。我々は、最近のテンソルネットワーク、低ランクプラス量子化、および層間共有法に対してMinimaを配置し、小さな層間アダプタを用いた共有テンソルバックボーンによるより攻撃的な構造圧縮に向けた実践的なステップであると主張している。

論文の概要: A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models

関連論文リスト